Abstract
Exascale systems built using multi-core processors are expected to experience several component faults during code executions lasting for hours. It is important to detect faults in processor cores so that faulty cores can be removed from scheduler pools, nodes with high failures can be swapped out, applications can be migrated, and check-point recoveries can be initiated. We propose light-weight codes that utilize chaotic computations and customized threads to detect component faults in multi-core processors. They concurrently execute dedicated threads that implement Poincare and identity maps, which are customized to isolate faults in arithmetic operations, memory elements and interconnects. The instruction execution errors and local memory errors are detected by threads dedicated to processor cores, and errors in inter-processor crossconnects are detected by global-local memory movements. We present preliminary implementation results on 4- and 48-core HP workstations under simulated faults.
Original language | English |
---|---|
Pages | 27-32 |
Number of pages | 6 |
DOIs | |
State | Published - 2013 |
Event | 3rd ACM Workshop on Fault-Tolerance for HPC at eXtreme Scale, FTXS 2013 - New York, NY, United States Duration: Jun 18 2013 → Jun 18 2013 |
Conference
Conference | 3rd ACM Workshop on Fault-Tolerance for HPC at eXtreme Scale, FTXS 2013 |
---|---|
Country/Territory | United States |
City | New York, NY |
Period | 06/18/13 → 06/18/13 |
Keywords
- chaotic maps
- exascale systems
- fault detection
- multi-core processors
- resilience