Abstract
The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of SC 2015 |
| Subtitle of host publication | The International Conference for High Performance Computing, Networking, Storage and Analysis |
| Publisher | IEEE Computer Society |
| ISBN (Electronic) | 9781450337236 |
| DOIs | |
| State | Published - Nov 15 2015 |
| Event | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 - Austin, United States Duration: Nov 15 2015 → Nov 20 2015 |
Publication series
| Name | International Conference for High Performance Computing, Networking, Storage and Analysis, SC |
|---|---|
| Volume | 15-20-November-2015 |
| ISSN (Print) | 2167-4329 |
| ISSN (Electronic) | 2167-4337 |
Conference
| Conference | International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015 |
|---|---|
| Country/Territory | United States |
| City | Austin |
| Period | 11/15/15 → 11/20/15 |
Funding
The authors would like to thank Robert Clay, Michael Heroux and Josep Gamell for interesting discussions related to this work. This work is partially supported by the NSF (award #1339820), and the CREST project of the Japan Science and Technology Agency (JST). This work is also partially supported by the U.S. Department of Energy (DOE) National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
Keywords
- MPI
- agreement
- fault-tolerance
Fingerprint
Dive into the research topics of 'Practical scalable consensus for pseudo-synchronous distributed systems'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver