TY - GEN
T1 - A Multi-faceted Approach to Job Placement for Improved Performance on Extreme-Scale Systems
AU - Zimmer, Christopher
AU - Gupta, Saurabh
AU - Atchley, Scott
AU - Vazhkudai, Sudharshan S.
AU - Albing, Carl
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Job placement plays a pivotal role in application performance on supercomputers. We present a multi-faceted exploration to influence placement in extreme-scale systems, to improve network performance and decrease variability. In our first exploration, Scores, we developed a machine learning model that extracts features from a job's node-allocation and grades performance. This identified several important node-metrics that led to Dual-Ended scheduling, a means of reducing network contention without impacting utilization. In evaluations on the Titan supercomputer, we observed reductions in average hop-count by up to 50%. We also developed an improved node-layout strategy that targets a better balance between network latency and bandwidth, replacing the default ALPS layout on Titan that resulted in an average of 10% runtime improvement. Both of these efforts underscore the importance of a job placement strategy that is cognizant of workload mixture and network topology.
AB - Job placement plays a pivotal role in application performance on supercomputers. We present a multi-faceted exploration to influence placement in extreme-scale systems, to improve network performance and decrease variability. In our first exploration, Scores, we developed a machine learning model that extracts features from a job's node-allocation and grades performance. This identified several important node-metrics that led to Dual-Ended scheduling, a means of reducing network contention without impacting utilization. In evaluations on the Titan supercomputer, we observed reductions in average hop-count by up to 50%. We also developed an improved node-layout strategy that targets a better balance between network latency and bandwidth, replacing the default ALPS layout on Titan that resulted in an average of 10% runtime improvement. Both of these efforts underscore the importance of a job placement strategy that is cognizant of workload mixture and network topology.
UR - http://www.scopus.com/inward/record.url?scp=85017194017&partnerID=8YFLogxK
U2 - 10.1109/SC.2016.86
DO - 10.1109/SC.2016.86
M3 - Conference contribution
AN - SCOPUS:85017194017
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
SP - 1015
EP - 1025
BT - Proceedings of SC 2016
PB - IEEE Computer Society
T2 - 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016
Y2 - 13 November 2016 through 18 November 2016
ER -