TY - GEN
T1 - Active/active replication for highly available HPC system services
AU - Engelmann, C.
AU - Scott, S. L.
AU - Leangsuksun, C.
AU - He, X.
PY - 2006
Y1 - 2006
N2 - Today's high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.
AB - Today's high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.
UR - http://www.scopus.com/inward/record.url?scp=33750954729&partnerID=8YFLogxK
U2 - 10.1109/ARES.2006.23
DO - 10.1109/ARES.2006.23
M3 - Conference contribution
AN - SCOPUS:33750954729
SN - 0769525679
SN - 9780769525679
T3 - Proceedings - First International Conference on Availability, Reliability and Security, ARES 2006
SP - 639
EP - 645
BT - Proceedings - First International Conference on Availability, Reliability and Security, ARES 2006
T2 - 1st International Conference on Availability, Reliability and Security, ARES 2006
Y2 - 20 April 2006 through 22 April 2006
ER -