Abstract
Cloud computing platforms can provide the compu-tational resources required for training large machine learning models such as deep neural networks. While the pay-as-you- go nature of cloud virtual machines (VMs) makes it easy to spin-up large clusters for training models, it can also lead to ballooning costs. The 100s of virtual machine sizes provided by cloud platforms also makes it extremely challenging to select the 'right' cloud cluster configuration for training. Furthermore, the training time and cost of distributed model training is highly sensitive to the cluster configurations, and presents a large and complex tradeoff-space. In this paper, we develop principled and practical techniques for optimizing the training time and cost of distributed ML model training on the cloud. Our key insight is that both the parallel and statistical efficiency must be considered when selecting the optimum job configuration parameters such as the number of workers and the batch size. By combining conventional parallel scaling concepts and new insights into SGD noise, we develop models for estimating the time and cost on different cluster configurations. Using the repetitive nature of training and our performance models, our Scavenger cloud service can search for optimum cloud configurations in a black-box, online manner. Our approach reduces training times by 2 x and costs by more than 50 %. Our performance models are accurate to within 2 %, and our search imposes only a 10% overhead compared to an ideal oracle- based approach.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 |
| Editors | Yogesh Simmhan, Ilkay Altintas, Ana-Lucia Varbanescu, Pavan Balaji, Abhinandan S. Prasad, Lorenzo Carnevale |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 403-413 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798350301199 |
| DOIs | |
| State | Published - 2023 |
| Externally published | Yes |
| Event | 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 - Bangalore, India Duration: May 1 2023 → May 4 2023 |
Publication series
| Name | Proceedings - 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 |
|---|
Conference
| Conference | 23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2023 |
|---|---|
| Country/Territory | India |
| City | Bangalore |
| Period | 05/1/23 → 05/4/23 |
Funding
VII. CONCLUSION The training time and cost for large machine learning models is significant, and sensitive to many job and cloud configuration parameters. Scavenger is a cloud service which uses online profiling and new performance models for estimating the training performance on different cloud configurations, with high accuracy of over 95%, and reduces training time by 2×. Acknowledgements. This research was supported by the NSF grant OAC-2112606.