Active/active replication for highly available HPC system services

C. Engelmann, S. L. Scott, C. Leangsuksun, X. He

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

Today's high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.

Original languageEnglish
Title of host publicationProceedings - First International Conference on Availability, Reliability and Security, ARES 2006
Pages639-645
Number of pages7
DOIs
StatePublished - 2006
Event1st International Conference on Availability, Reliability and Security, ARES 2006 - Vienna, Austria
Duration: Apr 20 2006Apr 22 2006

Publication series

NameProceedings - First International Conference on Availability, Reliability and Security, ARES 2006
Volume2006

Conference

Conference1st International Conference on Availability, Reliability and Security, ARES 2006
Country/TerritoryAustria
CityVienna
Period04/20/0604/22/06

Fingerprint

Dive into the research topics of 'Active/active replication for highly available HPC system services'. Together they form a unique fingerprint.

Cite this