TY - GEN
T1 - Optimizing large-scale graph analysis on a multi-threaded, multi-core platform
AU - Cong, Guojing
AU - Makarychev, Konstantin
PY - 2011
Y1 - 2011
N2 - The erratic memory access pattern makes it hard to implement fast large-scale graph analysis. Although algorithms of fine-grain parallelism seem to benefit from multithreading, it is unclear whether the long memory latency of such workload is fully masked on current systems, and if not, whether improving locality brings any performance benefit, especially when the cache is simple. We optimize several fundamental graph algorithms on a multi-threaded, multi-core platform, with simple caches. Although the naive implementation scales, we show nonetheless the number of hardware threads is insufficient to fully mask the memory latency for typical graph analysis workload and the processor is unlikely to be fully utilized. In optimizing for cache performance, we show that known cache-friendly designs that prove effective on traditional architectures do not perform well on this platform. We explore low-cost measures such as software prefetching and manipulating the storage of the input to improve performance. Our results show that compared with the original implementation speedups between 10% and 200% are achieved at different number of threads with our optimization.
AB - The erratic memory access pattern makes it hard to implement fast large-scale graph analysis. Although algorithms of fine-grain parallelism seem to benefit from multithreading, it is unclear whether the long memory latency of such workload is fully masked on current systems, and if not, whether improving locality brings any performance benefit, especially when the cache is simple. We optimize several fundamental graph algorithms on a multi-threaded, multi-core platform, with simple caches. Although the naive implementation scales, we show nonetheless the number of hardware threads is insufficient to fully mask the memory latency for typical graph analysis workload and the processor is unlikely to be fully utilized. In optimizing for cache performance, we show that known cache-friendly designs that prove effective on traditional architectures do not perform well on this platform. We explore low-cost measures such as software prefetching and manipulating the storage of the input to improve performance. Our results show that compared with the original implementation speedups between 10% and 200% are achieved at different number of threads with our optimization.
KW - Multi-threading
KW - Parallel Graph Algorithms
KW - Software Prefetch
KW - Traversal
UR - http://www.scopus.com/inward/record.url?scp=80053249743&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2011.393
DO - 10.1109/IPDPS.2011.393
M3 - Conference contribution
AN - SCOPUS:80053249743
SN - 9780769543857
T3 - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
SP - 688
EP - 697
BT - Proceedings - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
T2 - 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011
Y2 - 16 May 2011 through 20 May 2011
ER -