TY - GEN
T1 - Modeling the impact of checkpoints on next-generation systems
AU - Oldfield, Ron A.
AU - Teller, Patricia J.
AU - Varela, Maria Ruiz
AU - Arunagiri, Sarala
AU - Seelam, Seetharami
AU - Riesen, Rolf
AU - Roth, Philip C.
PY - 2007
Y1 - 2007
N2 - The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.
AB - The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.
UR - http://www.scopus.com/inward/record.url?scp=47249142074&partnerID=8YFLogxK
U2 - 10.1109/MSST.2007.4367962
DO - 10.1109/MSST.2007.4367962
M3 - Conference contribution
AN - SCOPUS:47249142074
SN - 0769530257
SN - 9780769530253
T3 - Proceedings - 24th IEEE Conference on Mass Storage Systems and Technologies, MSST 2007
SP - 30
EP - 43
BT - Proceedings - 24th IEEE Conference on Mass Storage Systems and Technologies, MSST 2007
T2 - 24th IEEE Conference on Mass Storage Systems and Technologies, MSST 2007
Y2 - 24 September 2007 through 27 September 2007
ER -