Abstract
Today's petascale supercomputers are comprised of ten's of thousands of compute nodes. Failures on these massive machines are a growing problem as the time for a single compute node to fail is shrinking. Ideally, the job scheduler would like the capability to predict node failures ahead of time in order to minimize the impact of node failures on overall job throughput. However, due to the tight power constraints of future systems, the online modeling of real-time error data must be accomplished using as little power as possible. To this end, the IBM TrueNorth Neurosynaptic System is used to create a Spiking Neural Network (SNN) model of supercomputer failure data and the classification accuracy of this model is compared to other Machine Learning (ML) and Deep Learning (DL) techniques. It is observed that the TrueNorth failure classification model yields a training accuracy of 99.41%, validation accuracy of 98.12% and testing accuracy of 99.80% and outperforms other machine learning and deep learning approaches. Moreover, the TrueNorth SNN consumes five orders of magnitude less power than the other ML/DL approaches during the testing phase. Additionally, it is observed that all ML/DL approaches investigated as part of this study are able to produce accurate models of the supercomputer system failure data.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence, SSCI 2018 |
| Editors | Suresh Sundaram |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 242-249 |
| Number of pages | 8 |
| ISBN (Electronic) | 9781538692769 |
| DOIs | |
| State | Published - Jul 2 2018 |
| Externally published | Yes |
| Event | 8th IEEE Symposium Series on Computational Intelligence, SSCI 2018 - Bangalore, India Duration: Nov 18 2018 → Nov 21 2018 |
Publication series
| Name | Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence, SSCI 2018 |
|---|
Conference
| Conference | 8th IEEE Symposium Series on Computational Intelligence, SSCI 2018 |
|---|---|
| Country/Territory | India |
| City | Bangalore |
| Period | 11/18/18 → 11/21/18 |
Funding
VI. ACKNOWLEDGMENT This work was funded by the Air Force Research Lab (AFRL), USA (Award Number: FA8750-15-2-0078). The authors are grateful to IBM Research for lending the TrueNorth chip, development kit and organizing the TrueNorth Boot-camp, and would specifically like to thank Dr. Ben Shaw for getting our paper reviewed by IBM. Dr. Catherine Schumann of the Oak Ridge National Laboratory was very generous to provide her feedback on the paper. The authors would further like to thank Rensselaer Polytechnic Institute (RPI) and the Center for Computational Innovation (CCI) at RPI.
Keywords
- Deep Learning
- Machine Learning
- Neuromorphic Computing
- Supercomputer Failures