Abstract
Memory failures in future extreme scale applications are a significant concern in the high-performance computing community and have attracted much research attention. We contend in this paper that using application checkpoint data to detect memory failures has potential benefits and is preferable to examining application memory. To support this contention, we describe the application of machine learning techniques to evaluate the veracity of checkpoint data. Our preliminary results indicate that supervised decision tree machine learning approaches can effectively detect corruption in restart files, suggesting that future extreme-scale applications and systems may benefit from incorporating such approaches in order to cope with memory failues.
Original language | English |
---|---|
Title of host publication | Euro-Par 2015 |
Subtitle of host publication | Parallel Processing Workshops - Euro-Par 2015 International Workshops, Revised Selected Papers |
Editors | Sascha Hunold, Josef Weidendorfer, Domingo Gimenez, Laura Ricci, Stefan Lankes, Alexandru Costan, Ana Lucia Varbanescu, Stephen L. Scott, María Engracia Gómez Requena, Vittorio Scarano, Alexandru Iosup, Michael Alexander |
Publisher | Springer Verlag |
Pages | 669-681 |
Number of pages | 13 |
ISBN (Print) | 9783319273075 |
DOIs | |
State | Published - 2015 |
Event | International Workshops on Parallel Processing Workshops, Euro-Par 2015 - Vienna, Austria Duration: Aug 24 2015 → Aug 25 2015 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 9523 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | International Workshops on Parallel Processing Workshops, Euro-Par 2015 |
---|---|
Country/Territory | Austria |
City | Vienna |
Period | 08/24/15 → 08/25/15 |
Funding
P. M. Widener, K. B. Ferreira, N. Fabian—Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly-owned subsidiary of Lockheed MartinCorporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.