TY - GEN
T1 - Scalable and fault tolerant failure detection and consensus
AU - Katti, Amogh
AU - Di Fatta, Giuseppe
AU - Naughton, Thomas
AU - Engelmann, Christian
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/9/21
Y1 - 2015/9/21
N2 - Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault- Tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.
AB - Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI Comm shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault- Tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.
KW - Consensus
KW - Failure detection
KW - Fault- Tolerant MPI
KW - Gossip protocols
KW - User-level failure mitigation
UR - http://www.scopus.com/inward/record.url?scp=84983398159&partnerID=8YFLogxK
U2 - 10.1145/2802658.2802660
DO - 10.1145/2802658.2802660
M3 - Conference contribution
AN - SCOPUS:84983398159
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015
PB - Association for Computing Machinery
T2 - 22nd European MPI Users' Group Meeting, EuroMPI 2015
Y2 - 21 September 2015 through 23 September 2015
ER -