TY - GEN
T1 - Task-graph scheduling extensions for efficient synchronization and communication
AU - Bak, Seonmyeong
AU - Hernandez, Oscar
AU - Gates, Mark
AU - Luszczek, Piotr
AU - Sarkar, Vivek
N1 - Publisher Copyright:
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2021/6/3
Y1 - 2021/6/3
N2 - Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
AB - Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
KW - Gang scheduling
KW - OpenMP
KW - Runtime system
KW - Task graph
KW - Work stealing
UR - http://www.scopus.com/inward/record.url?scp=85107481624&partnerID=8YFLogxK
U2 - 10.1145/3447818.3461616
DO - 10.1145/3447818.3461616
M3 - Conference contribution
AN - SCOPUS:85107481624
T3 - Proceedings of the International Conference on Supercomputing
SP - 88
EP - 101
BT - ICS 2021 - Proceedings of the 2021 ACM International Conference on Supercomputing
PB - Association for Computing Machinery
T2 - 35th ACM International Conference on Supercomputing, ICS 2021
Y2 - 14 June 2021 through 17 June 2021
ER -