TY - GEN
T1 - Realization of user level fault tolerant policy management through a holistic approach for fault correlation
AU - Park, Byung H.
AU - Naughton, Thomas J.
AU - Agarwal, Pratul
AU - Bernholdt, David E.
AU - Geist, Al
AU - Tippens, Jennifer L.
PY - 2011
Y1 - 2011
N2 - Many modern scientific applications, which are designed to utilize high performance parallel computers, occupy hundreds of thousands of computational cores running for days or even weeks. Since many scientists compete for resources, most supercomputing centers practice strict scheduling policies and perform meticulous accounting on their usage. Thus computing resources and time assigned to a user is considered invaluable. However, most applications are not well prepared for unforeseeable faults, still relying on primitive fault tolerance techniques. Considering that ever-plunging mean time to interrupt (MTTI) is making scientific applications more vulnerable to faults, it is increasingly important to provide users not only an improved fault tolerant environment, but also a framework to support their own fault tolerance policies so that their allocation times can be best utilized. This paper addresses a user level fault tolerance policy management based on a holistic approach to digest and correlate fault related information. It introduces simple semantics with which users express their policies on faults, and illustrates how event correlation techniques can be applied to manage and determine the most preferable user policies. The paper also discusses an implementation of the framework using open source software, and demonstrates, as an example, how a molecular dynamics simulation application running on the institutional cluster at Oak Ridge National Laboratory benefits from it.
AB - Many modern scientific applications, which are designed to utilize high performance parallel computers, occupy hundreds of thousands of computational cores running for days or even weeks. Since many scientists compete for resources, most supercomputing centers practice strict scheduling policies and perform meticulous accounting on their usage. Thus computing resources and time assigned to a user is considered invaluable. However, most applications are not well prepared for unforeseeable faults, still relying on primitive fault tolerance techniques. Considering that ever-plunging mean time to interrupt (MTTI) is making scientific applications more vulnerable to faults, it is increasingly important to provide users not only an improved fault tolerant environment, but also a framework to support their own fault tolerance policies so that their allocation times can be best utilized. This paper addresses a user level fault tolerance policy management based on a holistic approach to digest and correlate fault related information. It introduces simple semantics with which users express their policies on faults, and illustrates how event correlation techniques can be applied to manage and determine the most preferable user policies. The paper also discusses an implementation of the framework using open source software, and demonstrates, as an example, how a molecular dynamics simulation application running on the institutional cluster at Oak Ridge National Laboratory benefits from it.
UR - https://www.scopus.com/pages/publications/80052392611
U2 - 10.1109/POLICY.2011.34
DO - 10.1109/POLICY.2011.34
M3 - Conference contribution
AN - SCOPUS:80052392611
SN - 9780769543307
T3 - Proceedings - 2011 IEEE International Symposium on Policies for Distributed Systems and Networks, POLICY 2011
SP - 17
EP - 24
BT - Proceedings - 2011 IEEE International Symposium on Policies for Distributed Systems and Networks, POLICY 2011
T2 - 2011 IEEE International Symposium on Policies for Distributed Systems and Networks, POLICY 2011
Y2 - 6 June 2011 through 8 June 2011
ER -