“Alps” – one year on from the first GH200 rollout

Wednesday, June 11, 2025 11:29 AM to 11:40 AM · 11 min. (Europe/Berlin)

Hall Z - 3rd floor

HPC Around the World

AI Applications powered by HPC TechnologiesExtreme-scale SystemsHeterogeneous System ArchitecturesHPC in the Cloud and HPC Containers

Information

Alps is CSCS’s cloud-native flagship, combining a low-latency Slingshot 11 fabric with a heterogeneous fleet of CPU-only, AMD MI250X, AMD MI300A, NVIDIA A100, plus the world’s first production-scale partition of NVDIA GH200 Grace-Hopper superchips. A thin virtualization layer carves the machine into versatile software-defined clusters (vClusters) that can expand or shrink on demand, while shared vServices give multiple tenants true “HPC-as-a-service” without compromising performance or security.
Since entering production and year ago it has:

• run MeteoSwiss’s operational 1 km NWP suite on A100s alongside traditional HPC, large-scale data analytics, and ML workloads;

• scaled from the first tests with 128 GH200 nodes in late 2023 to nearly 11 000 superchips by June 2024, with full acceptance in July;

• powered most 2024 Gordon Bell finalists, including the climate-model winner.

Today, Alps underpins the Swiss AI Initiative. Sciensts from the AI Centers of ETH Zürich and EPFL are pre-training a 70-billion-parameter LLM on ~40 % of the GH200 partition. Single jobs at scale can train on up to 500 billion tokes in a single run, reaching up to 48h of stable runtime, with average performance within 94% of peak. The underlying infrastructure was engineered by CSCS in partneship with NVDIA and HPE. The pre-praining on 15-trillion-tokens will finish in time for public release this summer.

Format

On DemandOn Site