Coordinated Fault Tolerance for High Performance Computing

    Project: Research

    Project Details

    Description

    New FY2008 funds were received under B&R category KJ0101020. The new FWP is ERKJD66.This project will create a "FaultTolerance Backplane" (FTB) and build the infrastructure necessary to enable systems to adapt to faults in a holistic manner. Theapproach will be to design a reference implementation of a fault awareness and notification backplane to provide common, uniform,event-handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allowslibraries, run-time systems, and applications to connect to and use the fault-tolerant backplane; and extend key libraries andapplications to validate the interface choices, and to form the critical mass necessary for adoption in the community. The FTB willbe designed and built to provide light-weight coordination and rudimentary prediction capabilities. The FTB will allow applicationsto survive many types of errors. The project will initially work with chemistry and fusion applications and then extend the adaptivefault capabilities to other Scientific Discovery through Advanced Computing applications.

    StatusFinished
    Effective start/end date09/30/0609/30/09

    Fingerprint

    Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.