A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction

Azzam Haidar, Hatem Ltaief, Piotr Luszczek, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

22 Scopus citations

Abstract

We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (∼130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms of cache utilization and power consumption in the context of this new two-stage bidiagonal reduction.

Original languageEnglish
Title of host publicationProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Pages25-35
Number of pages11
DOIs
StatePublished - 2012
Externally publishedYes
Event2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012 - Shanghai, China
Duration: May 21 2012May 25 2012

Publication series

NameProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Conference

Conference2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Country/TerritoryChina
CityShanghai
Period05/21/1205/25/12

Keywords

  • Bidiagonal Reduction
  • Bulge Chasing
  • Dynamic Scheduling
  • Granularity Analysis
  • Power Profiling
  • Tile Algorithms
  • Two-Stage Approach

Fingerprint

Dive into the research topics of 'A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction'. Together they form a unique fingerprint.

Cite this