Abstract
We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor Cores, a half-precision version of the DeepLabv3+ network achieves a peak and sustained throughput of 1.13 EF/s and 999.0 PF/s respectively.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 649-660 |
| Number of pages | 12 |
| ISBN (Electronic) | 9781538683842 |
| DOIs | |
| State | Published - Jul 2 2018 |
| Event | 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 - Dallas, United States Duration: Nov 11 2018 → Nov 16 2018 |
Publication series
| Name | Proceedings - International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 |
|---|
Conference
| Conference | 2018 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 |
|---|---|
| Country/Territory | United States |
| City | Dallas |
| Period | 11/11/18 → 11/16/18 |
Funding
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This work was supported by a grant from the Swiss National Supercomputing Centre (CSCS) under Project ID g107. We thank Nicholas Cardo, Andreas Joksch, Miguel Gila and the CSCS staff for assistance in using Piz Daint. We thank Paul Tucker and Rajat Monga from Google for helpful discussions pertaining to TensorFlow. Michael Wehner, Karthik Kashinath, Burlen Loring, Travis O’Brien and Bill Collins from LBNL were instrumental in motivating the climate science problem and providing datasets. This research used the Summit system at the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We are very grateful to OLCF staff: Veronica Melesse Vergara; Don Maxwell, and Matthew Ezell for their assistance with the runs, and Arjun Shankar; Ashley Barker; Tjerk Straatsma and Jack Wells for programmatic support.