TY - GEN
T1 - JOSHUA
T2 - 2006 IEEE International Conference on Cluster Computing, Cluster 2006
AU - Uhlemann, K.
AU - Engelmann, C.
AU - Scott, S. L.
PY - 2006
Y1 - 2006
N2 - Most of today's HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present a generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as availability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance trade-off.
AB - Most of today's HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present a generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as availability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance trade-off.
UR - http://www.scopus.com/inward/record.url?scp=46049083585&partnerID=8YFLogxK
U2 - 10.1109/CLUSTR.2006.311855
DO - 10.1109/CLUSTR.2006.311855
M3 - Conference contribution
AN - SCOPUS:46049083585
SN - 1424403286
SN - 9781424403288
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
BT - 2006 IEEE International Conference on Cluster Computing, Cluster 2006
Y2 - 25 September 2006 through 28 September 2006
ER -