TY - GEN
T1 - Fast and small short vector SIMD matrix multiplication kernels for the synergistic processing element of the CELL processor
AU - Alvaro, Wesley
AU - Kurzak, Jakub
AU - Dongarra, Jack
PY - 2008
Y1 - 2008
N2 - Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crutial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C∈=∈C∈-∈A ×B T operation and the C∈=∈C∈-∈A ×B operation for matrices of size 64 ×64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.
AB - Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crutial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C∈=∈C∈-∈A ×B T operation and the C∈=∈C∈-∈A ×B operation for matrices of size 64 ×64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.
UR - http://www.scopus.com/inward/record.url?scp=47749147387&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-69384-0_98
DO - 10.1007/978-3-540-69384-0_98
M3 - Conference contribution
AN - SCOPUS:47749147387
SN - 3540693831
SN - 9783540693833
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 935
EP - 944
BT - Computational Science - ICCS 2008 - 8th International Conference, Proceedings
T2 - 8th International Conference on Computational Science, ICCS 2008
Y2 - 23 June 2008 through 25 June 2008
ER -