Characterizing the Performance of Executing Many-tasks on Summit

Matteo Turilli, Andre Merzky, Thomas Naughton, Wael Elwasif, Shantenu Jha

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

Many scientific workloads are comprised of many tasks, where each task is an independent simulation or analysis of data. The execution of millions of tasks on heterogeneous HPC platforms requires scalable dynamic resource management and multi-level scheduling. RADICAL-Pilot (RP) - an implementation of the Pilot abstraction, addresses these challenges and serves as an effective runtime system to execute workloads comprised of many tasks. In this paper, we characterize the performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit: RP is responsible for resource management and task scheduling on acquired resource; JSM or PRRTE enact the placement of launching of scheduled tasks. Our experiments provide lower bounds on the performance of RP when integrated with JSM and PRRTE. Specifically, for workloads comprised of homogeneous single-core, 15 minutes-long tasks we find that: PRRTE scales better than JSM for > O(1000) tasks; PRRTE overheads are negligible; and PRRTE supports optimizations that lower the impact of overheads and enable resource utilization of 63% when executing O(16K), 1-core tasks over 404 compute nodes.

Original languageEnglish
Title of host publicationProceedings of IPDRM 2019
Subtitle of host publication3rd Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages18-25
Number of pages8
ISBN (Electronic)9781728159935
DOIs
StatePublished - Nov 2019
Event3rd IEEE/ACM Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, IPDRM 2019 - Denver, United States
Duration: Nov 22 2019 → …

Publication series

NameProceedings of IPDRM 2019: 3rd Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference3rd IEEE/ACM Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, IPDRM 2019
Country/TerritoryUnited States
CityDenver
Period11/22/19 → …

Funding

We would like the thank other members of the PMIx community, and Ralph Castain in particular, for the excellent work that we build upon. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. At Rutgers, this work was also supported by NSF “CAREER” ACI-1253644, RADICAL-Cybertools NSF 1440677 and 1931512, and DOE Award DE-SC0016280. We also acknowledge DOE INCITE awards for allocations on Summit.

Keywords

  • Data-Vortex;-irregular-application;-high-performance-computing

Fingerprint

Dive into the research topics of 'Characterizing the Performance of Executing Many-tasks on Summit'. Together they form a unique fingerprint.

Cite this