Infrastructure-Aware TensorFlow for Heterogeneous Datacenters

Moiz Arif, M. Mustafa Rafique, Seung Hwan Lim, Zaki Malik

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Heterogeneous datacenters, with a variety of compute, memory, and network resources, are becoming increasingly popular to address the resource requirements of time-sensitive applications. One such application framework is the TensorFlow platform, which has become a platform of choice for running machine learning workloads. The state-of-the-art TensorFlow platform is oblivious to the availability and performance profiles of the underlying datacenter resources and does not incorporate resource requirements of the given workloads for distributed training. This leads to executing the training tasks on busy and resource-constrained worker nodes, which results in a significant increase in the overall training time. In this paper, we address this challenge and propose architectural improvements and new software modules in the default TensorFlow platform to make it aware of the availability and capabilities of the underlying datacenter resources. The proposed Infrastructure-Aware Tensor-Flow efficiently schedules the training tasks on the best possible resources for execution and reduces the overall training time. Our evaluation using the worker nodes with varying availability and performance profiles shows that the proposed enhancements yield up to 54 % reduced training time as compared to the default TensorFlow platform.

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2020
PublisherIEEE Computer Society
ISBN (Electronic)9781728192383
DOIs
StatePublished - Nov 17 2020
Event28th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2020 - Nice, France
Duration: Nov 17 2020Nov 18 2020

Publication series

NameProceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
Volume2020-November
ISSN (Print)1526-7539

Conference

Conference28th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2020
Country/TerritoryFrance
CityNice
Period11/17/2011/18/20

Funding

Results presented in this paper are obtained using the Chameleon and CloudLab testbeds supported by the National Science Foundation. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy (DOE). The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes. The DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (https://www.energy.gov/downloads/doe-public-access-plan).

FundersFunder number
National Science FoundationDE-AC05-00OR22725
U.S. Department of Energy

    Keywords

    • Distributed TensorFlow
    • datacenter resource management
    • datacenter utilization
    • heterogeneous datacen-ters

    Fingerprint

    Dive into the research topics of 'Infrastructure-Aware TensorFlow for Heterogeneous Datacenters'. Together they form a unique fingerprint.

    Cite this