Performance analysis and optimization for scalable deployment of deep learning models for country-scale settlement mapping on Titan supercomputer

Kuldeep Kurte, Jibonananda Sanyal, Anne Berres, Dalton Lunga, Mark Coletti, Hsiuhan Lexie Yang, Daniel Graves, Benjamin Liebersohn, Amy Rose

Research output: Contribution to journalArticlepeer-review

9 Scopus citations

Abstract

This paper presents a scalable object detection workflow for detecting objects, such as settlements, from remotely sensed (RS) imagery. We have successfully deployed this workflow on Titan supercomputer and utilized it for the task of mapping human settlement at a country scale. The performance of various stages in the workflow was analyzed before making it operational. The workflow implemented various strategies to address issues such as suboptimal resource utilization and long-tail effects due to unbalanced image workload, data loss due to runtime failures, and maximum wall-time constraints imposed by Titan's job scheduling policy. A mean shift clustering–based static load balancing strategy was implemented, which partitions the image load such that each partition contained similar-sized images. Furthermore, a checkpoint-restart strategy was added in the workflow as a fault-tolerance mechanism to prevent the data losses due to unforeseen runtime failures. The performance of the above-mentioned strategies was observed in various scenarios, such as node failure, exceeding wall time, and successful completion. Using this workflow, we have processed an RS data set that has a spatial resolution of 0.31 m and is comprised of 685 675 km2 of area of the Republic of Zambia in under six hours using 5426 nodes of the Titan supercomputer.

Original languageEnglish
Article numbere5305
JournalConcurrency and Computation: Practice and Experience
Volume31
Issue number20
DOIs
StatePublished - Oct 25 2019

Funding

This research was conducted under an ALCC allocation at the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility operated by the Oak Ridge National Laboratory. It was supported by the US government federal funding. Authors would like to thank OLCF for their support of this work, and the organization of the 2017 GPU Hackathon. This research also used resources of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Keywords

  • HPC
  • convolutional neural network
  • deep learning
  • fault tolerance
  • human settlement mapping
  • load balancing

Fingerprint

Dive into the research topics of 'Performance analysis and optimization for scalable deployment of deep learning models for country-scale settlement mapping on Titan supercomputer'. Together they form a unique fingerprint.

Cite this