TY - GEN
T1 - Practical resource monitoring for robust high throughput computing
AU - Juve, Gideon
AU - Tovar, Benjamin
AU - Da Silva, Rafael Ferreira
AU - Król, Dariusz
AU - Thain, Douglas
AU - Deelman, Ewa
AU - Allcock, William
AU - Livny, Miron
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/10/26
Y1 - 2015/10/26
N2 - Robust high throughput computing requires effective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload machines, causing failures and slowdowns, or underutilize machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computational tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We describe various forms of monitoring and the available mechanisms in contemporary operating systems. We then present two specific monitoring tools that choose different tradeoffs in overhead and accuracy, and evaluate them on a selection of benchmarks.
AB - Robust high throughput computing requires effective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload machines, causing failures and slowdowns, or underutilize machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computational tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We describe various forms of monitoring and the available mechanisms in contemporary operating systems. We then present two specific monitoring tools that choose different tradeoffs in overhead and accuracy, and evaluate them on a selection of benchmarks.
KW - High-Throughput Computing
KW - Monitoring
KW - Profiling
UR - http://www.scopus.com/inward/record.url?scp=84959282843&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2015.115
DO - 10.1109/CLUSTER.2015.115
M3 - Conference contribution
AN - SCOPUS:84959282843
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 650
EP - 657
BT - Proceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE International Conference on Cluster Computing, CLUSTER 2015
Y2 - 8 September 2015 through 11 September 2015
ER -