Abstract
As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI communications. Advanced fault-tolerance techniques have the potential to prevent full-scale application restart and therefore lower the cost incurred for each failure, but they demand from MPI the capability to detect failures and resume communications afterward. In this paper, we present a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed. The motivation behind the design choices are weighted against alternatives, a task that requires simultaneously considering MPI from the viewpoint of both the user and the implementor. The usability of the interfaces for expressing advanced recovery techniques is then discussed, including the difficult issue of enabling separate software layers to coordinate their recovery.
Original language | English |
---|---|
Pages (from-to) | 244-254 |
Number of pages | 11 |
Journal | International Journal of High Performance Computing Applications |
Volume | 27 |
Issue number | 3 |
DOIs | |
State | Published - Aug 2013 |
Externally published | Yes |
Funding
This document describes combined research conducted under the following contracts: NSF-0904952 and NSF-1144042 between the U.S. National Science Foundation and the University of Tennessee, and DE-FC02-11ER26059 supported by the U.S. Department of Energy.
Funders | Funder number |
---|---|
National Science Foundation | 0904952 |
U.S. Department of Energy | |
University of Tennessee | DE-FC02-11ER26059 |
Keywords
- Fault tolerance
- message passing interface
- user-level failure mitigation