TY - GEN
T1 - The case for modular redundancy in large-scale high performance computing systems
AU - Engelmann, Christian
AU - Ong, Hong
AU - Scott, Stephen L.
PY - 2009
Y1 - 2009
N2 - Recent investigations into resilience of large-scale high-performance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such as for aerospace and command & control systems. The primary argument against modular redundancy for resilience in HPC has always been that the capability of a HPC system, and respective return on investment, would be significantly reduced. We argue that modular redundancy can significantly increase compute node availability as it removes the impact of scale from single compute node MTTR. We further argue that single compute nodes can be much less reliable, and therefore less expensive, and still be highly available, if their MTTR/MTTF ratio is maintained.
AB - Recent investigations into resilience of large-scale high-performance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such as for aerospace and command & control systems. The primary argument against modular redundancy for resilience in HPC has always been that the capability of a HPC system, and respective return on investment, would be significantly reduced. We argue that modular redundancy can significantly increase compute node availability as it removes the impact of scale from single compute node MTTR. We further argue that single compute nodes can be much less reliable, and therefore less expensive, and still be highly available, if their MTTR/MTTF ratio is maintained.
KW - Fault tolerance
KW - High availability
KW - High-performance computing
KW - Modular redundancy
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=74549140832&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:74549140832
SN - 9780889867840
T3 - Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2009
SP - 189
EP - 194
BT - Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2009
T2 - IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2009
Y2 - 16 February 2009 through 18 February 2009
ER -