TY - JOUR
T1 - Addressing failures in exascale computing
AU - Snir, Marc
AU - Wisniewski, Robert W.
AU - Abraham, Jacob A.
AU - Adve, Sarita V.
AU - Bagchi, Saurabh
AU - Balaji, Pavan
AU - Belak, Jim
AU - Bose, Pradip
AU - Cappello, Franck
AU - Carlson, Bill
AU - Chien, Andrew A.
AU - Coteus, Paul
AU - Debardeleben, Nathan A.
AU - Diniz, Pedro C.
AU - Engelmann, Christian
AU - Erez, Mattan
AU - Fazzari, Saverio
AU - Geist, Al
AU - Gupta, Rinku
AU - Johnson, Fred
AU - Krishnamoorthy, Sriram
AU - Leyffer, Sven
AU - Liberty, Dean
AU - Mitra, Subhasish
AU - Munson, Todd
AU - Schreiber, Rob
AU - Stearley, Jon
AU - Hensbergen, Eric Van
PY - 2014/5
Y1 - 2014/5
N2 - We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach.The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.
AB - We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach.The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.
KW - Resilience
KW - exascale
KW - extreme-scale computing
KW - fault-tolerance
KW - high-performance computing
UR - http://www.scopus.com/inward/record.url?scp=84900560822&partnerID=8YFLogxK
U2 - 10.1177/1094342014522573
DO - 10.1177/1094342014522573
M3 - Article
AN - SCOPUS:84900560822
SN - 1094-3420
VL - 28
SP - 129
EP - 173
JO - International Journal of High Performance Computing Applications
JF - International Journal of High Performance Computing Applications
IS - 2
ER -