Abstract
In this paper, we develop a predictive model useful for output performance prediction of supercomputer file systems under production load. Our target environment is Titan-the 3rd fastest supercomputer in the world-and its Lustre-based multi-stage write path. We observe from Titan that although output performance is highly variable at small time scales, the mean performance is stable and consistent over typical application run times. Moreover, we find that output performance is non-linearly related to its correlated parameters due to interference and saturation on individual stages on the path. These observations enable us to build a predictive model of expected write times of output patterns and I/O configurations, using feature transformations to capture non-linear relationships. We identify the candidate features based on the structure of the Lustre/Titan write path, and use feature transformation functions to produce a model space with 135,000 candidate models. By searching for the minimal mean square error in this space we identify a good model and show that it is effective.
Original language | English |
---|---|
Title of host publication | HPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing |
Publisher | Association for Computing Machinery, Inc |
Pages | 181-192 |
Number of pages | 12 |
ISBN (Electronic) | 9781450346993 |
DOIs | |
State | Published - Jun 26 2017 |
Event | 26th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017 - Washington, United States Duration: Jun 26 2017 → Jun 30 2017 |
Publication series
Name | HPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing |
---|
Conference
Conference | 26th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017 |
---|---|
Country/Territory | United States |
City | Washington |
Period | 06/26/17 → 06/30/17 |
Funding
This work was supported by the U.S. Department of Energy, under FWP 16-018666, program manager Lucy Nowell. The work used resources of the Oak Ridge Leadership Computing Facility, located in the National Center for Computational Sciences at the Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Keywords
- Linear regression
- Output performance
- Petascale supercomputer