Proposal of MPI operation level checkpoint/rollback and one implementation

Tang Yuan, Graham E. Fagg, Jack J. Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI's capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user's data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-and-restart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint/ rollback framework. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system will be gradually unveiled.

Original languageEnglish
Title of host publicationSixth IEEE International Symposium on Cluster Computing and the Grid
Subtitle of host publicationSpanning the World and Beyond, 2006. CCGRID 06
Pages27-34
Number of pages8
DOIs
StatePublished - 2006
Externally publishedYes
Event6th IEEE International Symposium on Cluster Computing and the Grid, 2006. CCGRID 06 - , Singapore
Duration: May 16 2006May 19 2006

Publication series

NameSixth IEEE International Symposium on Cluster Computing and the Grid, 2006. CCGRID 06

Conference

Conference6th IEEE International Symposium on Cluster Computing and the Grid, 2006. CCGRID 06
Country/TerritorySingapore
Period05/16/0605/19/06

Fingerprint

Dive into the research topics of 'Proposal of MPI operation level checkpoint/rollback and one implementation'. Together they form a unique fingerprint.

Cite this