

“Alps” – one year on from the first GH200 rollout
Wednesday, June 11, 2025 11:29 AM to 11:40 AM · 11 min. (Europe/Berlin)
Hall Z - 3rd floor
HPC Around the World
AI Applications powered by HPC TechnologiesExtreme-scale SystemsHeterogeneous System ArchitecturesHPC in the Cloud and HPC Containers
Information
Alps is CSCS’s cloud-native flagship, combining a low-latency Slingshot 11 fabric with a heterogeneous fleet of CPU-only, AMD MI250X, AMD MI300A, NVIDIA A100, plus the world’s first production-scale partition of NVDIA GH200 Grace-Hopper superchips. A thin virtualization layer carves the machine into versatile software-defined clusters (vClusters) that can expand or shrink on demand, while shared vServices give multiple tenants true “HPC-as-a-service” without compromising performance or security.
Since entering production and year ago it has:
• run MeteoSwiss’s operational 1 km NWP suite on A100s alongside traditional HPC, large-scale data analytics, and ML workloads;
• scaled from the first tests with 128 GH200 nodes in late 2023 to nearly 11 000 superchips by June 2024, with full acceptance in July;
• powered most 2024 Gordon Bell finalists, including the climate-model winner.
Today, Alps underpins the Swiss AI Initiative. Sciensts from the AI Centers of ETH Zürich and EPFL are pre-training a 70-billion-parameter LLM on ~40 % of the GH200 partition. Single jobs at scale can train on up to 500 billion tokes in a single run, reaching up to 48h of stable runtime, with average performance within 94% of peak. The underlying infrastructure was engineered by CSCS in partneship with NVDIA and HPE. The pre-praining on 15-trillion-tokens will finish in time for public release this summer.
Since entering production and year ago it has:
• run MeteoSwiss’s operational 1 km NWP suite on A100s alongside traditional HPC, large-scale data analytics, and ML workloads;
• scaled from the first tests with 128 GH200 nodes in late 2023 to nearly 11 000 superchips by June 2024, with full acceptance in July;
• powered most 2024 Gordon Bell finalists, including the climate-model winner.
Today, Alps underpins the Swiss AI Initiative. Sciensts from the AI Centers of ETH Zürich and EPFL are pre-training a 70-billion-parameter LLM on ~40 % of the GH200 partition. Single jobs at scale can train on up to 500 billion tokes in a single run, reaching up to 48h of stable runtime, with average performance within 94% of peak. The underlying infrastructure was engineered by CSCS in partneship with NVDIA and HPE. The pre-praining on 15-trillion-tokens will finish in time for public release this summer.
Format
On DemandOn Site
Registered attendees
AM
Adrian Marszalik
Manager Mass Storage DepartmentAcademic Computer Centre CYFRONETof the AGH University of Krakow
Christie Alappat
ResearcherUniversity of Erlangen-Nuremberg, Erlangen Nartional High Performance Computing Center
Daniel Večerka
It specialistCzech technical university in Prague, Faculty of Eletrical Engineering