Abstract
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf™ is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications, driven by the MLCommons™ Association. We present the results from the first submission round including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization and communication scheduling enabling overall $\gt 10\times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch-sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O and network behaviour to parameterize extended roofline performance models in future rounds.
Original language | English |
---|---|
Title of host publication | Proceedings of MLHPC 2021 |
Subtitle of host publication | Workshop on Machine Learning in High Performance Computing Environments, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 33-45 |
Number of pages | 13 |
ISBN (Electronic) | 9781665411240 |
DOIs | |
State | Published - 2021 |
Event | 7th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, MLHPC 2021 - St. Louis, United States Duration: Nov 15 2021 → … |
Publication series
Name | Proceedings of MLHPC 2021: Workshop on Machine Learning in High Performance Computing Environments, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 7th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, MLHPC 2021 |
---|---|
Country/Territory | United States |
City | St. Louis |
Period | 11/15/21 → … |
Keywords
- Benchmarks
- Deep Learning
- Scientific Applications
- Supercomputers