Abstract
Machine Learning (ML) has become a critical tool enabling new methods of analysis and driving deeper understanding of phenomena across scientific disciplines. There is a growing need for “learning systems” to support various phases in the ML lifecycle. While others have focused on supporting model development, training, and inference, few have focused on the unique challenges inherent in science, such as the need to publish and share models and to serve them on a range of available computing resources. In this paper, we present the Data and Learning Hub for science (DLHub), a learning system designed to support these use cases. Specifically, DLHub enables publication of models, with descriptive metadata, persistent identifiers, and flexible access control. It packages arbitrary models into portable servable containers, and enables low-latency, distributed serving of these models on heterogeneous compute resources. We show that DLHub supports low-latency model inference comparable to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, and improved performance, by up to 95%, with batching and memoization enabled. We also show that DLHub can scale to concurrently serve models on 500 containers. Finally, we describe five case studies that highlight the use of DLHub for scientific applications.
| Original language | English |
|---|---|
| Pages (from-to) | 64-76 |
| Number of pages | 13 |
| Journal | Journal of Parallel and Distributed Computing |
| Volume | 147 |
| DOIs | |
| State | Published - Jan 2021 |
| Externally published | Yes |
Funding
This work was supported in part by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory and the RAMSES project, both from the U.S. Department of Energy under Contract DE-AC02-06CH11357, the Defense Advanced Research Projects Agency under Grant Number HR00111820006, and NSF under Grant Numbers 1550588, 1931298, and 2004894. We thank Amazon Web Services for research credits and Argonne's Leadership Computing Facility and Joint Laboratory for System Evaluation for computing resources. This work was supported in part by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory and the RAMSES project, both from the U.S. Department of Energy under Contract DE-AC02-06CH11357 , the Defense Advanced Research Projects Agency under Grant Number HR00111820006 , and NSF under Grant Numbers 1550588 , 1931298 , and 2004894 . We thank Amazon Web Services for research credits and Argonne’s Leadership Computing Facility and Joint Laboratory for System Evaluation for computing resources.
Keywords
- DLHub
- Learning systems
- Machine learning
- Model serving