TY - GEN
T1 - Fault tolerance for OpenSHMEM
AU - Hao, Pengfei
AU - Pophale, Swaroop
AU - Shamis, Pavel
AU - Welch, Aaron
AU - Chapman, Barbara
AU - Venkata, Manjunath Gorentla
AU - Poole, Stephen
PY - 2014/10/6
Y1 - 2014/10/6
N2 - On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model. Copyright is held by the owner/author(s). Publication rights licensed to ACM.
AB - On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model. Copyright is held by the owner/author(s). Publication rights licensed to ACM.
UR - http://www.scopus.com/inward/record.url?scp=84939246833&partnerID=8YFLogxK
U2 - 10.1145/2676870.2676894
DO - 10.1145/2676870.2676894
M3 - Conference contribution
AN - SCOPUS:84939246833
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014
A2 - Couture, Nadine
A2 - Broman, David
A2 - Bastien, Christian
A2 - Broman, David
A2 - Dorta, Tomas
A2 - Pepper, Peter
PB - Association for Computing Machinery
T2 - 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014
Y2 - 6 October 2014 through 10 October 2014
ER -