TY - GEN
T1 - Self adaptive application level fault tolerance For parallel and distributed computing
AU - Chen, Zizhong
AU - Yang, Ming
AU - Francia, Guillermo
AU - Dongarra, Jack
PY - 2007
Y1 - 2007
N2 - Most application level fault tolerance schemes in literature are non-adaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed without incorporating information from system environments such as the amount of available memory and the local or network I/O bandwidth. However, from an application point of view, it is often desirable for fault tolerant high performance applications to be able to achieve high performance under whatever system environment it executes with as low fault tolerance overhead as possibile. In this paper, we demonstrate that, in order to achieve high reliability with as low performance penalty as possible, fault tolerant schemes in applications need to be able to adapt themselves to different system environments. We propose a framework under which different fault tolerant schemes can be incorporated in applications using an adaptive method. Under this framework, applications are able to choose near optimal fault tolerance schemes at run time according to the specific characteristics of the platform on which the application is executing.
AB - Most application level fault tolerance schemes in literature are non-adaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed without incorporating information from system environments such as the amount of available memory and the local or network I/O bandwidth. However, from an application point of view, it is often desirable for fault tolerant high performance applications to be able to achieve high performance under whatever system environment it executes with as low fault tolerance overhead as possibile. In this paper, we demonstrate that, in order to achieve high reliability with as low performance penalty as possible, fault tolerant schemes in applications need to be able to adapt themselves to different system environments. We propose a framework under which different fault tolerant schemes can be incorporated in applications using an adaptive method. Under this framework, applications are able to choose near optimal fault tolerance schemes at run time according to the specific characteristics of the platform on which the application is executing.
UR - http://www.scopus.com/inward/record.url?scp=34548785754&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2007.370604
DO - 10.1109/IPDPS.2007.370604
M3 - Conference contribution
AN - SCOPUS:34548785754
SN - 1424409101
SN - 9781424409105
T3 - Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM
BT - Proceedings - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007; Abstracts and CD-ROM
T2 - 21st International Parallel and Distributed Processing Symposium, IPDPS 2007
Y2 - 26 March 2007 through 30 March 2007
ER -