The malthusian catastrophe is upon us! Are the largest HPC machines ever up?

Patricia Kovatch, Matthew Ezell, Ryan Braby

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    6 Scopus citations

    Abstract

    Thomas Malthus, an English political economist who lived from 1766 to 1834, predicted that the earth's population would be limited by starvation since population growth increases geometrically and the food supply only grows linearly. He said, "the power of population is indefinitely greater than the power in the earth to provide subsistence for man," thus defining the Malthusian Catastrophe. There is a parallel between this prediction and the conventional wisdom regarding super-large machines: application problem size and machine complexity is growing geometrically, yet mitigation techniques are only improving linearly. To examine whether the largest machines are usable, the authors collected and examined component failure rates and Mean Time Between System Failure data from the world's largest production machines, including Oak Ridge National Laboratory's Jaguar and the University of Tennessee's Kraken. The authors also collected MTBF data for a variety of Cray XT series machines from around the world, representing over 6 Petaflops of compute power. An analysis of the data is provided as well as plans for future work. High performance computing's Malthusian Catastrophe hasn't happened yet, and advances in system resiliency should keep this problem at bay for many years to come.

    Original languageEnglish
    Title of host publicationEuro-Par 2011
    Subtitle of host publicationParallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Revised Selected Papers
    PublisherSpringer Verlag
    Pages211-220
    Number of pages10
    EditionPART 2
    ISBN (Print)9783642297397
    DOIs
    StatePublished - 2012
    Event17th Parallel Processing Workshops, Euro-Par 2011: CCPI 2011, CGWS 2011, HeteroPar 2011, HiBB 2011, HPCVirt 2011, HPPC 2011, HPSS 2011, MDGS 2011, ProPer 2011, Resilience 2011, UCHPC 2011, VHPC 2011 - Bordeaux, France
    Duration: Aug 29 2011Sep 2 2011

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    NumberPART 2
    Volume7156 LNCS
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference17th Parallel Processing Workshops, Euro-Par 2011: CCPI 2011, CGWS 2011, HeteroPar 2011, HiBB 2011, HPCVirt 2011, HPPC 2011, HPSS 2011, MDGS 2011, ProPer 2011, Resilience 2011, UCHPC 2011, VHPC 2011
    Country/TerritoryFrance
    CityBordeaux
    Period08/29/1109/2/11

    Keywords

    • MTBF
    • failures
    • high performance computing
    • resiliency
    • scalability

    Fingerprint

    Dive into the research topics of 'The malthusian catastrophe is upon us! Are the largest HPC machines ever up?'. Together they form a unique fingerprint.

    Cite this