TY - GEN
T1 - Maximizing the performance of irregular applications on multithreaded, NUMA systems
AU - Cong, Guojing
AU - Wen, Huifang
PY - 2013
Y1 - 2013
N2 - In modern shared-memory systems, the communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and hardware threads. Thus the performance of multithreaded applications varies with the mapping of software threads to logical processors. In our study we observe huge variation in application performance under different mappings. Moreover, applications with irregular access patterns perform poorly under the default mapping. We maximize application performance by balancing communication overhead and available resources. Remote access overhead in irregular applications dominates execution time and can not be reduced by mapping alone on NUMA systems when the logical processors span multiple chips. In addition to new data replication and distribution optimizations, we improve geographical locality by matching access pattern to the data layout. We further propose a locality-centric optimization for simultaneously reducing remote accesses and improving cache performance. Our approach achieves better performance than prior NUMA-specific techniques.
AB - In modern shared-memory systems, the communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and hardware threads. Thus the performance of multithreaded applications varies with the mapping of software threads to logical processors. In our study we observe huge variation in application performance under different mappings. Moreover, applications with irregular access patterns perform poorly under the default mapping. We maximize application performance by balancing communication overhead and available resources. Remote access overhead in irregular applications dominates execution time and can not be reduced by mapping alone on NUMA systems when the logical processors span multiple chips. In addition to new data replication and distribution optimizations, we improve geographical locality by matching access pattern to the data layout. We further propose a locality-centric optimization for simultaneously reducing remote accesses and improving cache performance. Our approach achieves better performance than prior NUMA-specific techniques.
UR - http://www.scopus.com/inward/record.url?scp=84891546902&partnerID=8YFLogxK
U2 - 10.1145/2535753.2535756
DO - 10.1145/2535753.2535756
M3 - Conference contribution
AN - SCOPUS:84891546902
SN - 9781450325035
T3 - Proc. of IA3 2013 - 3rd Workshop on Irregular Appl.: Architectures and Algorithms, Held in Conjunction with SC 2013: The Int. Conf. for High Performance Computing, Networking, Storage and Analysis
BT - Proc. of IA3 2013 - 3rd Workshop on Irregular Appl.
T2 - 3rd Workshop on Irregular Applications: Architectures and Algorithms, IA3 2013 - Held in Conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
Y2 - 17 November 2013 through 22 November 2013
ER -