TY - JOUR
T1 - A Job Sizing Strategy for High-Throughput Scientific Workflows
AU - Tovar, Benjamin
AU - Da Silva, Rafael Ferreira
AU - Juve, Gideon
AU - Deelman, Ewa
AU - Allcock, William
AU - Thain, Douglas
AU - Livny, Miron
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2018/2/1
Y1 - 2018/2/1
N2 - The user of a computing facility must make a critical decision when submitting jobs for execution: How many resources (such as cores, memory, and disk) should be requested for each job? If the request is too small, the job may fail due to resource exhaustion; if the request is too large, the job may succeed, but resources will be wasted. This decision is especially important when running hundreds of thousands of jobs in a high throughput workflow, which may exhibit complex, long tailed distributions of resource consumption. In this paper, we present a strategy for solving the job sizing problem: (1) applications are monitored and measured in user-space as they run; (2) the resource usage is collected into an online archive; and (3) jobs are automatically sized according to historical data in order to maximize throughput or minimize waste. We evaluate the solution analytically, and present case studies of applying the technique to high throughput physics and bioinformatics workflows consisting of hundreds of thousands of jobs, demonstrating an increase in throughput of 10-400 percent compared to naive approaches.
AB - The user of a computing facility must make a critical decision when submitting jobs for execution: How many resources (such as cores, memory, and disk) should be requested for each job? If the request is too small, the job may fail due to resource exhaustion; if the request is too large, the job may succeed, but resources will be wasted. This decision is especially important when running hundreds of thousands of jobs in a high throughput workflow, which may exhibit complex, long tailed distributions of resource consumption. In this paper, we present a strategy for solving the job sizing problem: (1) applications are monitored and measured in user-space as they run; (2) the resource usage is collected into an online archive; and (3) jobs are automatically sized according to historical data in order to maximize throughput or minimize waste. We evaluate the solution analytically, and present case studies of applying the technique to high throughput physics and bioinformatics workflows consisting of hundreds of thousands of jobs, demonstrating an increase in throughput of 10-400 percent compared to naive approaches.
KW - High throughput computing (HTC)
KW - automatic job sizing
KW - automatic provision of resources
KW - resource monitoring and enforcement
KW - throughput and waste optimization
UR - https://www.scopus.com/pages/publications/85040665550
U2 - 10.1109/TPDS.2017.2762310
DO - 10.1109/TPDS.2017.2762310
M3 - Article
AN - SCOPUS:85040665550
SN - 1045-9219
VL - 29
SP - 240
EP - 253
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 2
M1 - 8066333
ER -