Intelligence Plane: A Framework for Machine Learning Application Life-Cycle Management

Tuesday, June 10, 2025 3:00 PM to Thursday, June 12, 2025 4:00 PM · 2 days 1 hr. (Europe/Berlin)

Foyer D-G - 2nd floor

Research Poster

ML Systems and ToolsPerformance and Resource ModelingPerformance MeasurementResource Management and Scheduling

Information

Poster is on display and will be presented at the poster pitch session.

The growing dependence on high-performance computing (HPC) environments for scientific research in deep learning, genome sequencing, and weather simulation has revealed significant resource provisioning and workflow management challenges. Researchers often encounter complexities related to resource configurations, inconsistent job setups, steep learning curves for profiling tools, constant updates to HPC infrastructure, and limited tools for resource estimation. These difficulties often result in over-provisioning, underutilization of resources, longer queue times, and increased operational costs, ultimately impeding productivity and scientific advancement.
This research introduces an innovative AI-driven framework designed to overcome these challenges and optimize resource allocation in HPC environments. The framework incorporates two critical strategies: application-specific resource estimation using the HPC Application Resource Predictor (HARP) and intelligent CI-aware scheduling, which delivers context-sensitive execution recommendations tailored to the unique requirements of scientific workflows.
Key innovations of this framework include:
Efficient Training Data Generation: By employing downsized execution campaigns, the framework reduces the cost and time required for generating application-specific training data by a factor of seven, making it feasible to create accurate models across a wide range of applications.
DNN-Estimator for AI Workloads: This specialized tool predicts the resource requirements for deep neural network (DNN) training and inference tasks by analyzing factors such as architecture, dataset characteristics, and hardware configurations.
Cyberinfrastructure Configuration Database: A comprehensive database of HPC system configurations is constructed by combining manually collected information with programmatically generated data. This resource enables efficient validation of resource suitability and execution planning.
Intelligence Plane: This dynamic execution engine orchestrates job scheduling, monitoring, and rescheduling in real time, adapting to changes in HPC environments. It integrates advanced resource estimation and execution strategies to ensure efficient job management.
A practical use case in animal ecology highlights the real-world applicability of the framework. AI models analyze data collected from field sensors, triggering automated workflows for model retraining and resource optimization. The process spans edge devices, cloud platforms, and HPC systems. The Intelligence Plane dynamically monitors job progress, adjusts resource allocations, and retrains estimators when needed, ensuring efficient resource utilization. Automated processes such as image ingestion, labeling, model retraining, and job scheduling further demonstrate the framework’s ability to manage complex, multi-stage AI workflows seamlessly.
Future directions for this research include improving the DNN estimator's efficiency by leveraging generative AI to dynamically create High-Level Optimization (HLO) graphs for DNN architectures and hardware configurations. This would eliminate the dependency on hardware-specific execution for graph generation, significantly reducing computational overhead. Additionally, plans to integrate the iScheduler framework with a chatbot powered by Ollama models will enable intuitive user interactions, allowing researchers to query models, explore configurations, and receive real-time resource recommendations. This user-friendly approach simplifies resource management and enhances the user experience in HPC environments.
This framework significantly improves resource efficiency by addressing critical challenges in resource provisioning and workflow automation. It bridges the gap between AI-driven workflows and efficient HPC resource management, providing a scalable and sustainable solution for computationally intensive scientific research.

Contributors:

Format

On DemandOn Site