Abstract
This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms. Typically, these platforms are distributed- memory machines whose nodes are equipped with several ac- celerators. To the best of our knowledge, SLATE [9] is the only library that provides a publicly available implementation on such platforms, and it is currently limited to problem instances where the C matrix can entirely fit in the memory of the GPU accelera- tors. Our algorithm relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node. The algorithm is written with the PARSEC runtime system, which allows for a fast and generic implementation, while achieving close-to-peak performance.
Original language | English |
---|---|
Title of host publication | Proceedings of ScalA 2019 |
Subtitle of host publication | 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 33-41 |
Number of pages | 9 |
ISBN (Electronic) | 9781728159898 |
DOIs | |
State | Published - Nov 2019 |
Externally published | Yes |
Event | 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019 - Denver, United States Duration: Nov 18 2019 → … |
Publication series
Name | Proceedings of ScalA 2019: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2019 |
---|---|
Country/Territory | United States |
City | Denver |
Period | 11/18/19 → … |
Funding
ACKNOWLEDGEMENT This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. It used resources of the Oak Ridge Leadership Computing Facility at ORNL, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Keywords
- Accelerator architectures
- Linear Algebra
- Run- time environment