Practical resource monitoring for robust high throughput computing

Gideon Juve, Benjamin Tovar, Rafael Ferreira Da Silva, Dariusz Król, Douglas Thain, Ewa Deelman, William Allcock, Miron Livny

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

22 Scopus citations

Abstract

Robust high throughput computing requires effective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload machines, causing failures and slowdowns, or underutilize machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computational tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We describe various forms of monitoring and the available mechanisms in contemporary operating systems. We then present two specific monitoring tools that choose different tradeoffs in overhead and accuracy, and evaluate them on a selection of benchmarks.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages650-657
Number of pages8
ISBN (Electronic)9781467365987
DOIs
StatePublished - Oct 26 2015
EventIEEE International Conference on Cluster Computing, CLUSTER 2015 - Chicago, United States
Duration: Sep 8 2015Sep 11 2015

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2015-October
ISSN (Print)1552-5244

Conference

ConferenceIEEE International Conference on Cluster Computing, CLUSTER 2015
Country/TerritoryUnited States
CityChicago
Period09/8/1509/11/15

Keywords

  • High-Throughput Computing
  • Monitoring
  • Profiling

Fingerprint

Dive into the research topics of 'Practical resource monitoring for robust high throughput computing'. Together they form a unique fingerprint.

Cite this