Abstract
As a child, were you ever afraid that a monster lurking in your bedroom would leap out of the dark and get you? My job at Oak Ridge National Laboratory is to worry about a similar monster, hiding in the steel cabinets of the supercomputers and threatening to crash the largest computing machines on the planet. The monster is something supercomputer specialists call resilience- or rather the lack of resilience. It has bitten several supercomputers in the past. A high-profile example affected what was the second fastest supercomputer in the world in 2002, a machine called ASCI Q at Los Alamos National Laboratory. When it was first installed at the New Mexico lab, this computer couldn't run more than an hour or so without crashing.
Original language | English |
---|---|
Article number | 7420396 |
Pages (from-to) | 30-35 |
Number of pages | 6 |
Journal | IEEE Spectrum |
Volume | 53 |
Issue number | 3 |
DOIs | |
State | Published - Mar 2016 |