Distinguished Lecture Series

Prof. Torsten Hoefler

ETH Zürich, Schweiz

20. April 2023, 16:15 Uhr

„Designing network support for High-Performance Deep Learning Systems“

via Zoom

Abstract:

The talk will cover principles for training large-scale AI models efficiently and how to design AI systems for this purpose. It will focus on discussing networking support for distributed large-scale AI training based on the fast that much of the progress in modern artificial intelligence is made by scaling to larger and larger deep learning models trained with more data. One such example is the GPT-3 model with 175 billion parameters that forms the basis for many services like Microsoft’s Copilot and ChatGPT. The talk will also explain how networking support can help with distributed training of these models. We will discuss communication patterns in AI and derive a design for a specialized network topology that improves cost per bandwidth by nearly 15 times.

Bio:

Torsten is a Full Professor of Computer Science at ETH Zürich, Switzerland. He held visiting positions at Argonne National Laboratory, Sandia National Laboratory, Microsoft's Quantum Computing team, and Microsoft's Azure Hardware Architecture group. Before joining ETH, he led the performance modeling and simulation efforts of parallel petascale applications for the NSF-funded Blue Waters project at NCSA/UIUC. He received his PhD from Indiana University under the advisory of Prof. Andrew Lumsdaine. For his MSc thesis on large-scale fast barrier synchornization, he received the best student award at the Technical University of Chemnitz.

He is also a key member of the Message Passing Interface (MPI) Forum where he chairs the „Collective Operations and Topologies“ working group. Torsten was elected into the first steering committee of ACM's SIGHPC in 2013 and he was re-elected in 2016 and 2019. He has chaired EuroMPI 2019 and the Hot Interconnects conference in 2013. He (co-)organized Supercomputing (SC18), ACM PASC'16, IEEE Hot Interconnects 2012 as program chair and IPDPS 2017 as track chair. He is currently on the Supercomputing SCXY steering committee.

Torsten published more than 200 peer-reviewed scientific conference and journal articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. Torsten won best paper awards at the ACM/IEEE Supercomputing Conference 2010 (SC10), EuroMPI 2013, SC13, SC14, SC19, IPDPS'15, ACM HPDC'15 and HPDC'16, ACM OOPSLA'16, and other conferences. According to Google Scholar, his work has been cited more than 8200 times and his h-index is 48. He was invited to present keynotes at significant conferences and workshops such as DISC'20, HPC China, MLHPC, PPAM, PARCO, EuroMPI and many others.

Torsten led the team that received the ACM Gordon Bell Prize in 2019 in two categories. For his work, he received the IEEE TCSC Award of Excellence (MCR) in 2019, ETH Zurich's Latsis Prize in 2015, the SIAM SIAG/Supercomputing Junior Scientist Prize in 2012, and the IEEE TCSC Young Achievers in Scalable Computing Award in 2013. He was also awarded the BenchCouncil Rising Star Award in 2020. Following his Ph.D., he received the Young Alumni Award 2014 from Indiana University. He was the first European to receive many of those honors.

His research interests revolve around the central topic of „Performance-centric System Design“ and include scalable networks, parallel programming techniques, and performance modeling. Currently, he is involved (co-)driving three major projects: (1) sPIN: designing portable network-offload engines (published at SC17 and SC19) to develop a data-movement-centric acceleration scheme similar to CUDA in the network cards and switches, (2) DAPP: data-centric programming schemes to allow a separation of concerns between domain scientist and performance engineer to enable performance-portable parallel programming in a Python-based high-performance framework (published at SC19, winner of Gordon Bell Prize 2019), and (3) EPI: as co-lead with Luca Benini for ETH Zurich's activities in the European Processor Initiative to develop a high-performance accelerator (published in IEEE TOC, DAC).

Additional information about Torsten can be found on his homepage.