Abstract
This paper presents a GPU implementation of an asynchronous iterative algorithm for computing incomplete factorizations. Asynchronous algorithms, with their ability to tolerate memory latency, form an important class of algorithms for modern computer architectures. Our GPU implementation considers several non-traditional techniques that can be important for asynchronous algorithms to optimize convergence and data locality. These techniques include controlling the order in which variables are updated by controlling the order of execution of thread blocks, taking advantage of cache reuse between thread blocks, and managing the amount of parallelism to control the convergence of the algorithm.
Original language | English |
---|---|
Article number | A1 |
Pages (from-to) | 1-16 |
Number of pages | 16 |
Journal | Lecture Notes in Computer Science |
Volume | 9137 LNCS |
DOIs | |
State | Published - 2015 |
Externally published | Yes |
Event | 30th International Conference on High Performance Computing, ISC 2015 - Frankfurt, Germany Duration: Jul 12 2015 → Jul 16 2015 |
Funding
This material is based upon work supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Numbers DE-SC-0012538 and DE-SC-0010042. Support from NVIDIA is also acknowledged.
Funders | Funder number |
---|---|
U.S. Department of Energy | |
Advanced Scientific Computing Research | DE-SC-0010042, DE-SC-0012538 |