Runtime concurrency control and operation scheduling for high performance neural network training

Jiawen Liu, Dong Li, Gokcen Kestor, Jeffrey Vetter

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

Training neural network (NN) often uses a machine learning framework such as TensorFlow and Caffe2. These frameworks employ a dataflow model where the NN training is modeled as a directed graph composed of a set of nodes. Operations in NN training are typically implemented by the frameworks as primitives and represented as nodes in the dataflow graph. Training NN models in a dataflow-based machine learning framework involves a large number of fine-grained operations whcih present diverse memory access patterns and computation intensity. Managing and scheduling those operations is challenging, because we have to decide the number of threads to run each operation (concurrency control) and schedule those operations for good hardware utilization and system throughput. In this paper, we extend an existing runtime system (the TensorFlow runtime) to enable automatic concurrency control and scheduling of operations. We explore performance modeling to predict the performance of operations with various thread-level parallelism. Our performance model is highly accurate and lightweight. Leveraging the performance model, our runtime system employs a set of scheduling strategies that co-run operations to improve hardware utilization and system throughput. Our runtime system demonstrates a significant performance benefit. Comparing with using the recommended configurations for concurrency control and operation scheduling in TensorFlow, our approach achieves 36% performance (execution time) improvement on average (up to 49%) for four neural network models, and achieves high performance close to the optimal one manually obtained by the user.

Original languageEnglish
Title of host publicationProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages188-199
Number of pages12
ISBN (Electronic)9781728112466
DOIs
StatePublished - May 2019
Event33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil
Duration: May 20 2019May 24 2019

Publication series

NameProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019

Conference

Conference33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
Country/TerritoryBrazil
CityRio de Janeiro
Period05/20/1905/24/19

Funding

This work was partially supported by U.S. National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194) and Chameleon Cloud. This work was also partially supported by the U.S. Department of Energy, Office for Advanced Scientific Computing (ASCR) under Award No. 66150: “CENATE: The Center for Advanced Technology Evaluation”. Pacific Northwest National Laboratory (PNNL) is a multiprogram national laboratory operated for DOE by Battelle Memorial Institute under Contract DE-AC05-76RL01830. This work was partially supported by U.S. National Science Foundation (CNS-1617967, CCF-1553645 and CCF-1718194) and Chameleon Cloud. This work was also partially supported by the U.S. Department of Energy, Office for Advanced Scientific Computing (ASCR) under Award No. 66150: ?CENATE: The Center for Advanced Technology Evaluation?. Pacific Northwest National Laboratory (PNNL) is a multiprogram national laboratory operated for DOE by Battelle Memorial Institute under Contract DE-AC05-76RL01830.

Fingerprint

Dive into the research topics of 'Runtime concurrency control and operation scheduling for high performance neural network training'. Together they form a unique fingerprint.

Cite this