Canaries in a coal mine: Using application-level checkpoints to detect memory failures

Patrick M. Widener, Kurt B. Ferreira, Scott Levy, Nathan Fabian

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Memory failures in future extreme scale applications are a significant concern in the high-performance computing community and have attracted much research attention. We contend in this paper that using application checkpoint data to detect memory failures has potential benefits and is preferable to examining application memory. To support this contention, we describe the application of machine learning techniques to evaluate the veracity of checkpoint data. Our preliminary results indicate that supervised decision tree machine learning approaches can effectively detect corruption in restart files, suggesting that future extreme-scale applications and systems may benefit from incorporating such approaches in order to cope with memory failues.

Original languageEnglish
Title of host publicationEuro-Par 2015
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2015 International Workshops, Revised Selected Papers
EditorsSascha Hunold, Josef Weidendorfer, Domingo Gimenez, Laura Ricci, Stefan Lankes, Alexandru Costan, Ana Lucia Varbanescu, Stephen L. Scott, María Engracia Gómez Requena, Vittorio Scarano, Alexandru Iosup, Michael Alexander
PublisherSpringer Verlag
Pages669-681
Number of pages13
ISBN (Print)9783319273075
DOIs
StatePublished - 2015
Externally publishedYes
EventInternational Workshops on Parallel Processing Workshops, Euro-Par 2015 - Vienna, Austria
Duration: Aug 24 2015Aug 25 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9523
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceInternational Workshops on Parallel Processing Workshops, Euro-Par 2015
Country/TerritoryAustria
CityVienna
Period08/24/1508/25/15

Funding

P. M. Widener, K. B. Ferreira, N. Fabian—Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly-owned subsidiary of Lockheed MartinCorporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

FundersFunder number
U.S. Department of Energy
National Nuclear Security AdministrationDE-AC04-94AL85000

    Fingerprint

    Dive into the research topics of 'Canaries in a coal mine: Using application-level checkpoints to detect memory failures'. Together they form a unique fingerprint.

    Cite this