TY - GEN
T1 - Job-site level fault tolerance for cluster and grid Environments
AU - Limaye, Kshitij
AU - Leangsuksun, Box
AU - Greenwood, Zeno
AU - Scott, Stephen L.
AU - Engelmann, Christian
AU - Libby, Richard
AU - Chanchio, Kasidit
PY - 2005
Y1 - 2005
N2 - In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called "Smart Failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state.
AB - In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The first approach depends on the system's MTTR while the latter approach depends on the availability of alternative sites to run replicas. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in Beowulf cluster-based grid environments, whereas existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called "Smart Failover" provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state.
UR - http://www.scopus.com/inward/record.url?scp=50249144002&partnerID=8YFLogxK
U2 - 10.1109/CLUSTR.2005.347043
DO - 10.1109/CLUSTR.2005.347043
M3 - Conference contribution
AN - SCOPUS:50249144002
SN - 0780394852
SN - 9780780394858
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
BT - 2005 IEEE International Conference on Cluster Computing, CLUSTER
T2 - 2005 IEEE International Conference on Cluster Computing, CLUSTER
Y2 - 27 September 2005 through 30 September 2005
ER -