TY - GEN
T1 - Evaluation of fault-tolerant policies using simulation
AU - Tikotekar, Anand
AU - Vallée, Geoffroy
AU - Naughton, Thomas
AU - Scott, Stephen L.
AU - Leangsuksun, Chokchai
PY - 2007
Y1 - 2007
N2 - Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than performing useful work. In order to decrease this overhead, researchers try to both optimize existing FT mechanisms and implement new FT policies. For instance, combining reactive and pro-active approaches in order to decrease the number of checkpoints that must be performed during the application's execution. However, currently no solutions exist which enable the evaluation of these FT approaches through simulation, instead experimentations must be done using real platforms. This increases complexity and limits experimentation into alternate solutions. This paper presents a simulation framework that evaluates different FT mechanisms and policies. The framework uses system failure logs for the simulation with a default behavior based on logs taken from the ASCI White at Lawrence Livermore National Laboratory. We evaluate the accuracy of our simulator comparing simulated results with those taken from experiments done on a 32-node compute cluster. Therefore such a simulator can be used to develop new FT policies and/or to tune existing policies.
AB - Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than performing useful work. In order to decrease this overhead, researchers try to both optimize existing FT mechanisms and implement new FT policies. For instance, combining reactive and pro-active approaches in order to decrease the number of checkpoints that must be performed during the application's execution. However, currently no solutions exist which enable the evaluation of these FT approaches through simulation, instead experimentations must be done using real platforms. This increases complexity and limits experimentation into alternate solutions. This paper presents a simulation framework that evaluates different FT mechanisms and policies. The framework uses system failure logs for the simulation with a default behavior based on logs taken from the ASCI White at Lawrence Livermore National Laboratory. We evaluate the accuracy of our simulator comparing simulated results with those taken from experiments done on a 32-node compute cluster. Therefore such a simulator can be used to develop new FT policies and/or to tune existing policies.
UR - http://www.scopus.com/inward/record.url?scp=53349098075&partnerID=8YFLogxK
U2 - 10.1109/CLUSTR.2007.4629244
DO - 10.1109/CLUSTR.2007.4629244
M3 - Conference contribution
AN - SCOPUS:53349098075
SN - 1424413885
SN - 9781424413881
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 303
EP - 311
BT - Proceedings - 2007 IEEE International Conference on Cluster Computing, CLUSTER 2007
T2 - 2007 IEEE International Conference on Cluster Computing, CLUSTER 2007
Y2 - 19 September 2007 through 20 September 2007
ER -