DLHub: Simplifying publication, discovery, and use of machine learning models in science

Zhuozhao Li, Ryan Chard, Logan Ward, Kyle Chard, Tyler J. Skluzacek, Yadu Babuji, Anna Woodard, Steven Tuecke, Ben Blaiszik, Michael J. Franklin, Ian Foster

Research output: Contribution to journalArticlepeer-review

25 Scopus citations

Abstract

Machine Learning (ML) has become a critical tool enabling new methods of analysis and driving deeper understanding of phenomena across scientific disciplines. There is a growing need for “learning systems” to support various phases in the ML lifecycle. While others have focused on supporting model development, training, and inference, few have focused on the unique challenges inherent in science, such as the need to publish and share models and to serve them on a range of available computing resources. In this paper, we present the Data and Learning Hub for science (DLHub), a learning system designed to support these use cases. Specifically, DLHub enables publication of models, with descriptive metadata, persistent identifiers, and flexible access control. It packages arbitrary models into portable servable containers, and enables low-latency, distributed serving of these models on heterogeneous compute resources. We show that DLHub supports low-latency model inference comparable to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, and improved performance, by up to 95%, with batching and memoization enabled. We also show that DLHub can scale to concurrently serve models on 500 containers. Finally, we describe five case studies that highlight the use of DLHub for scientific applications.

Original languageEnglish
Pages (from-to)64-76
Number of pages13
JournalJournal of Parallel and Distributed Computing
Volume147
DOIs
StatePublished - Jan 2021
Externally publishedYes

Funding

This work was supported in part by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory and the RAMSES project, both from the U.S. Department of Energy under Contract DE-AC02-06CH11357, the Defense Advanced Research Projects Agency under Grant Number HR00111820006, and NSF under Grant Numbers 1550588, 1931298, and 2004894. We thank Amazon Web Services for research credits and Argonne's Leadership Computing Facility and Joint Laboratory for System Evaluation for computing resources. This work was supported in part by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory and the RAMSES project, both from the U.S. Department of Energy under Contract DE-AC02-06CH11357 , the Defense Advanced Research Projects Agency under Grant Number HR00111820006 , and NSF under Grant Numbers 1550588 , 1931298 , and 2004894 . We thank Amazon Web Services for research credits and Argonne’s Leadership Computing Facility and Joint Laboratory for System Evaluation for computing resources.

Keywords

  • DLHub
  • Learning systems
  • Machine learning
  • Model serving

Fingerprint

Dive into the research topics of 'DLHub: Simplifying publication, discovery, and use of machine learning models in science'. Together they form a unique fingerprint.

Cite this