HARMONY: Large-Scale Architecture Search for Efficient Hybrid Language Models

HARMONY: Large-Scale Architecture Search for Efficient Hybrid Language Models

Tuesday, June 23, 2026 4:40 PM to 5:00 PM · 20 min. (Europe/Berlin)
Hall E - 2nd Floor
Research Paper
AI Applications powered by HPC TechnologiesAI FactoriesLarge Language Models and Generative AI in HPCML Model Optimization

Information

As large language models scale to trillions of parameters, their computational and memory requirements present critical challenges for efficient training and deployment. While Mixture of Experts (MoE) architectures enable efficient scaling through sparse parameter activation, and state-space models like Mamba offer linear-time complexity, principled methods for combining these paradigms remain undeveloped. We introduce HARMONY (Hybrid Architecture Research for Mamba, Optimized with Neural efficiencY), a multi-objective evolutionary neural architecture search framework for discovering efficient hybrid language models that integrate Transformer attention mechanisms, Mixture-of-Experts routing, and Mamba state-space components.
Through large-scale distributed search using 16,384 MI250X GPUs on the Frontier supercomputer, HARMONY explores a comprehensive design space encompassing six attention variants (MHA, MQA, GQA, MLA, SWA, and Mamba-2), variable MoE configurations with both routed and shared experts, and extensive Mamba hyperparameters. Our framework discovers heterogeneous architectures that balance training performance with computational efficiency through multi-objective optimization incorporating latency penalties and fitness-based selection.
Analysis of discovered architectures reveals that optimal hybrid designs favor heterogeneous component mixing rather than homogeneous patterns, with Mamba-2 and Multi-Head Latent Attention (MLA) emerging as preferred mechanisms. Discovered architectures demonstrate superior training efficiency: our best configuration achieves a final perplexity of 1.0874 with 2.38B parameters while processing 4,320 tokens/second, outperforming significantly larger manually designed models. Full-scale evaluation shows HARMONY's top architectures achieve better loss trajectories than equivalently-sized models using state-of-the-art configurations including Mixtral, Jamba, and Samba. Additionally, we demonstrate 91% weak scaling efficiency when training discovered 36B-parameter models across 1,024 GPUs.
HARMONY is released as an open framework with comprehensive tools for building and training hybrid models using expert-data-pipeline parallelism, democratizing access to automated architecture design for next-generation language models.
Contributors:
Format
on-site

Log in

See all the content and easy-to-use features by logging in or registering!