Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Yoav Tock, Benjamin Mandler, José Moreira, Terry Jones

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

As HPC systems and applications get bigger and more complex, we are approaching an era in which resiliency and run-time elasticity concerns become paramount. We offer a building block for an alternative resiliency approach in which computations will be able to make progress while components fail, in addition to enabling a dynamic set of nodes throughout a computation lifetime. The core of our solution is a hierarchical scalable membership service providing eventual consistency semantics. An attribute replication service is used for hierarchy organization, and is exposed to external applications. Our solution is based on P2P technologies and provides resiliency and elastic runtime support at ultra large scales. Resulting middleware is general purpose while exploiting HPC platform unique features and architecture. We have implemented and tested this system on BlueGene/P with Linux, and using worst-case analysis, evaluated the service scalability as effective for up to 1M nodes.

Original languageEnglish
Title of host publicationEuro-Par 2013 Parallel Processing - 19th International Conference, Proceedings
Pages354-366
Number of pages13
DOIs
StatePublished - 2013
Event19th International Conference on Parallel Processing, Euro-Par 2013 - Aachen, Germany
Duration: Aug 26 2013Aug 30 2013

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8097 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th International Conference on Parallel Processing, Euro-Par 2013
Country/TerritoryGermany
CityAachen
Period08/26/1308/30/13

Funding

This research received funding from the U.S. DoE under award No. DE-SC0002107; the European Community’s FP7/2007-2013 Programme under grant agreement No. 317862; and used resources of the Oak Ridge Leadership Computing Facility at ORNL, which is supported by the U.S. DoE Office of Science under Contract No. DE-AC05-00OR22725.

Fingerprint

Dive into the research topics of 'Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime'. Together they form a unique fingerprint.

Cite this