TY - JOUR
T1 - Symmetric active/active high availability for high-performance computing system services
AU - Engelmann, Christian
AU - Scott, Stephen L.
AU - Leangsuksun, Chokchai
AU - He, Xubin
PY - 2006
Y1 - 2006
N2 - This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.
AB - This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.
KW - Group communication
KW - High availability
KW - High-performance computing
KW - Virtual synchrony
UR - http://www.scopus.com/inward/record.url?scp=34548190800&partnerID=8YFLogxK
U2 - 10.4304/jcp.1.8.43-54
DO - 10.4304/jcp.1.8.43-54
M3 - Article
AN - SCOPUS:34548190800
SN - 1796-203X
VL - 1
SP - 43
EP - 54
JO - Journal of Computers
JF - Journal of Computers
IS - 8
ER -