Efficient Classification of Supercomputer Failures Using Neuromorphic Computing

  • Prasanna Date
  • , Christopher D. Carothers
  • , James A. Hendler
  • , Malik Magdon-Ismail

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

Today's petascale supercomputers are comprised of ten's of thousands of compute nodes. Failures on these massive machines are a growing problem as the time for a single compute node to fail is shrinking. Ideally, the job scheduler would like the capability to predict node failures ahead of time in order to minimize the impact of node failures on overall job throughput. However, due to the tight power constraints of future systems, the online modeling of real-time error data must be accomplished using as little power as possible. To this end, the IBM TrueNorth Neurosynaptic System is used to create a Spiking Neural Network (SNN) model of supercomputer failure data and the classification accuracy of this model is compared to other Machine Learning (ML) and Deep Learning (DL) techniques. It is observed that the TrueNorth failure classification model yields a training accuracy of 99.41%, validation accuracy of 98.12% and testing accuracy of 99.80% and outperforms other machine learning and deep learning approaches. Moreover, the TrueNorth SNN consumes five orders of magnitude less power than the other ML/DL approaches during the testing phase. Additionally, it is observed that all ML/DL approaches investigated as part of this study are able to produce accurate models of the supercomputer system failure data.

Original languageEnglish
Title of host publicationProceedings of the 2018 IEEE Symposium Series on Computational Intelligence, SSCI 2018
EditorsSuresh Sundaram
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages242-249
Number of pages8
ISBN (Electronic)9781538692769
DOIs
StatePublished - Jul 2 2018
Externally publishedYes
Event8th IEEE Symposium Series on Computational Intelligence, SSCI 2018 - Bangalore, India
Duration: Nov 18 2018Nov 21 2018

Publication series

NameProceedings of the 2018 IEEE Symposium Series on Computational Intelligence, SSCI 2018

Conference

Conference8th IEEE Symposium Series on Computational Intelligence, SSCI 2018
Country/TerritoryIndia
CityBangalore
Period11/18/1811/21/18

Funding

VI. ACKNOWLEDGMENT This work was funded by the Air Force Research Lab (AFRL), USA (Award Number: FA8750-15-2-0078). The authors are grateful to IBM Research for lending the TrueNorth chip, development kit and organizing the TrueNorth Boot-camp, and would specifically like to thank Dr. Ben Shaw for getting our paper reviewed by IBM. Dr. Catherine Schumann of the Oak Ridge National Laboratory was very generous to provide her feedback on the paper. The authors would further like to thank Rensselaer Polytechnic Institute (RPI) and the Center for Computational Innovation (CCI) at RPI.

Keywords

  • Deep Learning
  • Machine Learning
  • Neuromorphic Computing
  • Supercomputer Failures

Fingerprint

Dive into the research topics of 'Efficient Classification of Supercomputer Failures Using Neuromorphic Computing'. Together they form a unique fingerprint.

Cite this