Super-scalable algorithms for computing on 100,000 processors

Research output: Contribution to journalConference articlepeer-review

42 Scopus citations

Abstract

In the next five years, the number of processors in high-end systems for scientific computing is expected to rise to tens and even hundreds of thousands. For example, the IBM BlueGene/L can have up to 128,000 processors and the delivery of the first system is scheduled for 2005. Existing deficiencies in scalability and fault-tolerance of scientific applications need to be addressed soon. If the number of processors grows by a magnitude and efficiency drops by a magnitude, the overall effective computing performance stays the same. Furthermore, the mean time to interrupt of high-end computer systems decreases with scale and complexity. In a 100,000-processor system, failures may occur every couple of minutes and traditional checkpointing may no longer be feasible. With this paper, we summarize our recent research in super-scalable algorithms for computing on 100,000 processors. We introduce the algorithm properties of scale invariance and natural fault tolerance, and discuss how they can be applied to two different classes of algorithms. We also describe a super-scalable diskless check-pointing algorithm for problems that can't be transformed into a super-scalable variant, or where other solutions are more efficient. Finally, a 100,000-processor simulator is presented as a platform for testing and experimentation.

Original languageEnglish
Pages (from-to)313-321
Number of pages9
JournalLecture Notes in Computer Science
Volume3514
Issue numberI
DOIs
StatePublished - 2005
Event5th International Conference on Computational Science - ICCS 2005 - Atlanta, GA, United States
Duration: May 22 2005May 25 2005

Funding

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. DE-AC05-00OR22725. ★ Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. DE-AC05-00OR22725.

FundersFunder number
U.S. Department of EnergyDE-AC05-00OR22725
Oak Ridge National Laboratory
Laboratory Directed Research and Development
UT-Battelle

    Fingerprint

    Dive into the research topics of 'Super-scalable algorithms for computing on 100,000 processors'. Together they form a unique fingerprint.

    Cite this