TY - GEN
T1 - Improving concurrency and asynchrony in multithreaded MPI applications using software offloading
AU - Vaidyanathan, Karthikeyan
AU - Kalamkar, Dhiraj D.
AU - Pamnany, Kiran
AU - Hammond, Jeff R.
AU - Balaji, Pavan
AU - Das, Dipankar
AU - Park, Jongsoo
AU - Joó, Bálint
N1 - Publisher Copyright:
© 2015 ACM.
PY - 2015/11/15
Y1 - 2015/11/15
N2 - We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI-THREAD-MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.
AB - We present a new approach for multithreaded communication and asynchronous progress in MPI applications, wherein we offload communication processing to a dedicated thread. The central premise is that given the rapidly increasing core counts on modern systems, the improvements in MPI performance arising from dedicating a thread to drive communication outweigh the small loss of resources for application computation, particularly when overlap of communication and computation can be exploited. Our approach allows application threads to make MPI calls concurrently, enqueuing these as communication tasks to be processed by a dedicated communication thread. This not only guarantees progress for such communication operations, but also reduces load imbalance. Our implementation additionally significantly reduces the overhead of mutual exclusion seen in existing implementations for applications using MPI-THREAD-MULTIPLE. Our technique requires no modification to the application, and we demonstrate significant performance improvement (up to 2X) for QCD, 1-D FFT and deep learning CNN applications.
UR - http://www.scopus.com/inward/record.url?scp=84966639520&partnerID=8YFLogxK
U2 - 10.1145/2807591.2807602
DO - 10.1145/2807591.2807602
M3 - Conference contribution
AN - SCOPUS:84966639520
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2015
PB - IEEE Computer Society
T2 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015
Y2 - 15 November 2015 through 20 November 2015
ER -