Abstract
In high-performance computing, there is a perpetual hunt for performance and scalability. Supercomputers grow larger offering improved computational science throughput. Nevertheless, with an increase in the number of systems’ components and their interactions, the number of failures and the power consumption will increase rapidly. Energy and reliability are among the most challenging issues that need to be addressed for extreme scale computing. We develop analytical models for run time and energy usage for multilevel fault-tolerance schemes. We use these models to study the tradeoff between run time and energy in FTI, a recently developed multilevel checkpoint library, on an IBM Blue Gene/Q. Our results show that energy consumed by FTI is low and the tradeoff between the run time and energy is small. Using the analytical models, we explore the impact of various system-level parameters on run time and energy tradeoffs.
Original language | English |
---|---|
Title of host publication | High Performance Computing Systems |
Subtitle of host publication | Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, Revised Selected Papers |
Editors | Simon D. Hammond, Stephen A. Jarvis, Steven A. Wright |
Publisher | Springer Verlag |
Pages | 249-263 |
Number of pages | 15 |
ISBN (Print) | 9783319172477 |
DOIs | |
State | Published - 2015 |
Externally published | Yes |
Event | 5th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, PMBS 2014 - New Orleans, United States Duration: Nov 16 2014 → Nov 16 2014 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 8966 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 5th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, PMBS 2014 |
---|---|
Country/Territory | United States |
City | New Orleans |
Period | 11/16/14 → 11/16/14 |
Funding
This work was supported by the SciDAC and X-Stack activities within the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research program under contract number DE-AC02-06CH11357.