Coordinated Fault Tolerance for High Performance Computing

    Project: Research

    Project Details

    Description

    This project will create a "Fault Tolerance Backplane" (FTB) and build the infrastructure necessary to enable systems to adapt to faults in a holistic manner. The approach will beto design a reference implementation of a fault awareness and notification backplane to provide common, uniform, event-handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run-time systems, and applications to connect to and use the fault-tolerant backplane; and extend key libraries and applications to validate the interface choices, and to form the critical mass necessary for adoption in the community. The FTB will be designed and built to provide light-weight coordination and rudimentary prediction capabilities. The FTB will allow applications to survive many types of errors. The project will initially work with chemistry and fusion applications and then extend the adaptive fault capabilities to other Scientific Discovery through Advanced Computing applications.

    StatusFinished
    Effective start/end date09/30/0609/30/11

    Funding

    • U.S. Department of Energy

    Fingerprint

    Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.