TY - GEN
T1 - Proposal of MPI operation level checkpoint/rollback and one implementation
AU - Yuan, Tang
AU - Fagg, Graham E.
AU - Dongarra, Jack J.
PY - 2006
Y1 - 2006
N2 - With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI's capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user's data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-and-restart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint/ rollback framework. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system will be gradually unveiled.
AB - With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI's capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user's data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be recover-and-continue against the more traditional stop-and-restart model. Recover-and-continue means that in case an error occurs, we just re-spawn the failed processes. All the remaining living processes stay in their original processors mapping on memory. The main benefits of recover-and-continue are much less cost for system recovery and the opportunity of employing in-memory checkpoint/rollback techniques. Compared with stable or local disk techniques, which are the only choices for stop-and-restart, doubtlessly, the in-memory approach significantly reduces the performance penalty in checkpoint/rollback. Additionally, it makes it possible to establish a concurrent multiple level checkpoint/ rollback framework. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system will be gradually unveiled.
UR - http://www.scopus.com/inward/record.url?scp=33751116572&partnerID=8YFLogxK
U2 - 10.1109/CCGRID.2006.81
DO - 10.1109/CCGRID.2006.81
M3 - Conference contribution
AN - SCOPUS:33751116572
SN - 0769525857
SN - 9780769525853
T3 - Sixth IEEE International Symposium on Cluster Computing and the Grid, 2006. CCGRID 06
SP - 27
EP - 34
BT - Sixth IEEE International Symposium on Cluster Computing and the Grid
T2 - 6th IEEE International Symposium on Cluster Computing and the Grid, 2006. CCGRID 06
Y2 - 16 May 2006 through 19 May 2006
ER -