TY - GEN
T1 - Design and analysis of CXL performance models for tightly-coupled heterogeneous computing
AU - Cabrera, Anthony M.
AU - Young, Aaron R.
AU - Vetter, Jeffrey S.
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/4/2
Y1 - 2022/4/2
N2 - Truly heterogeneous systems enable partitioned workloads to be mapped to the hardware that nets the best performance. However, current practice requires that inter-device communication between different vendors' hardware use host memory as an intermediary step. To date, there are no widely adopted solutions that allow accelerators to directly transfer data. A new cache-coherent protocol, CXL, aims to facilitate easier, fine-grained sharing between accelerators. In this work we analyze existing methods for designing heterogeneous applications that target GPUs and FPGAs working collaboratively, followed by an exploration to show the benefits of a CXL-enabled system. Specifically, we develop a test application that utilizes both an NVIDIA P100 GPU and a Xilinx U250 FPGA to show current communication limitations. From this application, we capture overall execution time and throughput measurements on the FPGA and GPU. We use these measurements as inputs to novel CXL performance models to show that using CXL caching instead of host memory results in a 1.31X speedup, while a more tightly-coupled pipelined implementation using CXL-enabled hardware would result in a speedup of 1.45X.
AB - Truly heterogeneous systems enable partitioned workloads to be mapped to the hardware that nets the best performance. However, current practice requires that inter-device communication between different vendors' hardware use host memory as an intermediary step. To date, there are no widely adopted solutions that allow accelerators to directly transfer data. A new cache-coherent protocol, CXL, aims to facilitate easier, fine-grained sharing between accelerators. In this work we analyze existing methods for designing heterogeneous applications that target GPUs and FPGAs working collaboratively, followed by an exploration to show the benefits of a CXL-enabled system. Specifically, we develop a test application that utilizes both an NVIDIA P100 GPU and a Xilinx U250 FPGA to show current communication limitations. From this application, we capture overall execution time and throughput measurements on the FPGA and GPU. We use these measurements as inputs to novel CXL performance models to show that using CXL caching instead of host memory results in a 1.31X speedup, while a more tightly-coupled pipelined implementation using CXL-enabled hardware would result in a speedup of 1.45X.
KW - CXL
KW - FPGA
KW - GPU
KW - GPU-FPGA collaboration
KW - heterogeneous computing
UR - http://www.scopus.com/inward/record.url?scp=85135386198&partnerID=8YFLogxK
U2 - 10.1145/3529336.3530817
DO - 10.1145/3529336.3530817
M3 - Conference contribution
AN - SCOPUS:85135386198
T3 - Proceedings of 2022 1st International Workshop on Extreme Heterogeneity Solutions, ExHET 2022
BT - Proceedings of 2022 1st International Workshop on Extreme Heterogeneity Solutions, ExHET 2022
PB - Association for Computing Machinery, Inc
T2 - 1st International Workshop on Extreme Heterogeneity Solutions, ExHET 2022
Y2 - 2 April 2022
ER -