Refine: A Robust Approach to Unsupervised Anomaly Detection for Production HPC Systems

Thursday, June 12, 2025 1:00 PM to 1:25 PM · 25 min. (Europe/Berlin)

Hall F - 2nd floor

Research Paper

EngineeringHigh-Performance Data AnalyticsSystem and Performance Monitoring

Information

High-Performance Computing (HPC) systems are critical for many scientific applications, but they are often subject to performance variations due to ”anomalies”, which can lead to reduced efficiency and higher operational costs. To address this, machine learning (ML) techniques have been increasingly
applied to automatically detect performance anomalies. However, traditional unsupervised anomaly detection methods assume that training datasets are free of anomalies. In real-world HPC systems, though, data is typically contaminated by anomalies caused by factors such as shared resource contention, software bugs, or hardware failures. These anomalies in the training data
can significantly undermine the performance of ML models. To overcome this issue, we introduce Refine, a robust anomaly detection framework based on variational autoencoders (VAEs). Refine iteratively removes high-error samples during training in an unsupervised manner. By gradually reducing the proportion of anomalies in the training dataset based on reconstruction error, our approach enhances the model’s robustness and overall performance. We evaluate Refine using data collected from a pro-duction HPC system, Eclipse, and demonstrate its effectiveness in handling varying levels of contamination. Even with up to 10% anomalies in the training dataset, Refine achieves an F1-score of 0.88, outperforming state-of-the-art unsupervised anomaly detection methods. Moreover, when applied to real production system data, Refine achieves 100% accuracy in detecting anomalies.

Contributors:

Format

On DemandOn Site