TY - GEN
T1 - Toward an end-to-end framework for modeling, monitoring and anomaly detection for scientific workflows
AU - Mandal, Anirban
AU - Ruth, Paul
AU - Baldin, Ilya
AU - Król, Dariusz
AU - Juve, Gideon
AU - Mayani, Rajiv
AU - Da Silva, Rafael Ferreira
AU - Deelman, Ewa
AU - Meredith, Jeremy
AU - Vetter, Jeffrey
AU - Lynch, Vickie
AU - Mayer, Ben
AU - Wynne, James
AU - Blanco, Mark
AU - Carothers, Chris
AU - Lapre, Justin
AU - Tierney, Brian
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/18
Y1 - 2016/7/18
N2 - Modern science is often conducted on large scale, distributed, heterogeneous and high-performance computing infrastructures. Increasingly, the scale and complexity of both the applications and the underlying execution platforms have been growing. Scientific workflows have emerged as a flexible representation to declaratively express complex applications with data andcontrol dependences. However, it is extremely challengingfor scientists to execute their science workflows in a reliable and scalable way due to a lack of understanding of expected and realistic behavior of complex scientific workflows on large scale and distributed HPC systems. This is exacerbated by failures and anomalies in largescale systems and applications, which makes detecting, analyzing and acting on anomaly events challenging. In this work, we present a prototype of an end-to-end system for modeling and diagnosing the runtime performance of complex scientific workflows. We interfaced the Pegasus workflow management system, Aspen performance modeling, monitoring and anomaly detection into an integrated framework that not only improves the understanding of complex scientific applications on large scale complex infrastructure, but also detects anomalies and supports adaptivity. We present a black box modeling tool, a comprehensive online monitoring system, and anomaly detection algorithms that employ the models and monitoring data to detect anomaly events. We present an evaluation of the system with a Spallation Neutron Source workflow as a driving use case.
AB - Modern science is often conducted on large scale, distributed, heterogeneous and high-performance computing infrastructures. Increasingly, the scale and complexity of both the applications and the underlying execution platforms have been growing. Scientific workflows have emerged as a flexible representation to declaratively express complex applications with data andcontrol dependences. However, it is extremely challengingfor scientists to execute their science workflows in a reliable and scalable way due to a lack of understanding of expected and realistic behavior of complex scientific workflows on large scale and distributed HPC systems. This is exacerbated by failures and anomalies in largescale systems and applications, which makes detecting, analyzing and acting on anomaly events challenging. In this work, we present a prototype of an end-to-end system for modeling and diagnosing the runtime performance of complex scientific workflows. We interfaced the Pegasus workflow management system, Aspen performance modeling, monitoring and anomaly detection into an integrated framework that not only improves the understanding of complex scientific applications on large scale complex infrastructure, but also detects anomalies and supports adaptivity. We present a black box modeling tool, a comprehensive online monitoring system, and anomaly detection algorithms that employ the models and monitoring data to detect anomaly events. We present an evaluation of the system with a Spallation Neutron Source workflow as a driving use case.
KW - Anomaly detection
KW - Monitoring
KW - Performance modeling
KW - Scientific workflows
UR - http://www.scopus.com/inward/record.url?scp=84991632903&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2016.202
DO - 10.1109/IPDPSW.2016.202
M3 - Conference contribution
AN - SCOPUS:84991632903
T3 - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
SP - 1370
EP - 1379
BT - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Y2 - 23 May 2016 through 27 May 2016
ER -