Abstract
The paper describes Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non‐transposed matrix multiplication routine C = A ⋅ B, but also transposed multiplication routines C = AT ⋅ B, C = A ⋅ BT, and C = AT ⋅ BT, for a block cyclic data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.
| Original language | English |
|---|---|
| Pages (from-to) | 543-570 |
| Number of pages | 28 |
| Journal | Concurrency Practice and Experience |
| Volume | 6 |
| Issue number | 7 |
| DOIs | |
| State | Published - Oct 1994 |
| Externally published | Yes |