Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores

Fengguang Song, Jack Dongarra

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided and open-source software and wastes CPU cycles and energy. By expecting CPUs with hundreds of cores to be imminent, we have designed a new framework to perform matrix computations for massively many cores. Our performance analysis on manycore systems shows that the unscalability bottleneck is related to Non-Uniform Memory Access (NUMA): memory bus contention and remote memory access latency. To overcome the bottleneck, we have designed NUMA-aware tile algorithms with the help of a dynamic scheduling runtime system to minimize NUMA memory accesses. The main idea is to identify the data that is, either read a number of times or written once by a thread resident on a remote NUMA node, then utilize the runtime system to conduct data caching and movement between different NUMA nodes. Based on the experiments with QR factorizations, we demonstrate that our framework is able to achieve great scalability on a 48-core AMD Opteron system (e.g., parallel efficiency drops only 3% from one core to 48 cores). We also deploy our framework to an extreme-scale shared-memory SGI machine which has 1024 CPU cores and runs a single Linux operating system image. Our framework continues to scale well, and can outperform the vendor-optimized Intel MKL library by up to 750%.

Original languageEnglish
Title of host publicationICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing
PublisherAssociation for Computing Machinery
Pages333-342
Number of pages10
ISBN (Print)9781450326421
DOIs
StatePublished - 2014
Event28th ACM International Conference on Supercomputing, ICS 2014 - Munich, Germany
Duration: Jun 10 2014Jun 13 2014

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference28th ACM International Conference on Supercomputing, ICS 2014
Country/TerritoryGermany
CityMunich
Period06/10/1406/13/14

Keywords

  • manycore systems
  • numa
  • performance analysis
  • runtime system

Fingerprint

Dive into the research topics of 'Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores'. Together they form a unique fingerprint.

Cite this