Abstract
Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. In this work, we present HierKNEM, a kernel-assisted topology-aware collective framework, and the mechanisms deployed by this framework to orchestrate the collaboration between multiple layers of collective algorithms. The resulting scheme maximizes the overlap of intra- and inter-node communications. We demonstrate experimentally, by considering three of the most used collective operations (Broadcast, Allgather and Reduction), that (1) this approach is immune to modifications of the underlying process-core binding; (2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP); (3) it furthermore demonstrates a linear speedup with the increase of the number of cores per compute node, a paramount requirement for scalability on future many-core hardware.
Original language | English |
---|---|
Pages (from-to) | 1000-1010 |
Number of pages | 11 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 73 |
Issue number | 7 |
DOIs | |
State | Published - 2013 |
Funding
Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr ). Teng Ma is a Ph.D. student in EECS Department of University of Tennessee, Knoxville. His research interest focused on design, implementation, and modeling of parallel communication libraries, parallel computer architectures; cluster and grid computing; modeling of parallel benchmarks; clusters with hardware accelerators; parallel programming models; parallel I/O and distributed file system. George Bosilca is a Research Associate Professor of Electrical Engineering and Computer Science, University of Tennessee, Knoxville. He obtained his Ph.D. from the University of Paris XI with a background in computer architecture and parallel computing. Active member of several large scale projects (Open MPI, MPICH-V, CCI, DAGuE), his research encompass a large area in the distributed computing world. From low level network protocols to algorithmic based fault tolerance, Dr. Bosilca research target to reduce the gap between peak and sustained performance on large scale execution environments, effectively taking advantage of the heterogeneous capabilities of current and future computing platforms. Aurelien Bouteiller received his Ph.D. from University of Paris in 2006, under the direction of Franck Cappello. His research is focused on improving performance and reliability of distributed memory systems. Toward that goal, he investigated automatic (message logging based) checkpointing approaches in MPI, Algorithm Based fault tolerance approaches and their runtime support, mechanisms to improve communication speed and balance within nodes of many-core clusters, and employing emerging data flow programming models to negate the raise of jitter on large scale systems (DAGuE project). These works resulted in over twenty-five publications in international conferences and journals and three distinguished paper awards from IPDPS and EuroPar. He his also a contributor to Open MPI and participates to the MPI-3 Forum. Jack J. Dongarra holds an appointment at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced-computer architectures, programming methodology, and tools for parallel computers. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches; in 2008 he was the recipient of the first IEEE Medal of Excellence in Scalable Computing; in 2010 he was the first recipient of the SIAM Special Interest Group on Supercomputing’s award for Career Achievement; and in 2011 he was the recipient of the IEEE IPDPS 2011 Charles Babbage Award. He is a Fellow of the AAAS, ACM, IEEE, and SIAM and a member of the National Academy of Engineering.
Funders | Funder number |
---|---|
IEEE Foundation |
Keywords
- Cluster
- Collective communication
- HPC
- Hierarchical
- MPI
- Multicore