TY - GEN
T1 - L2 cache modeling for scientific applications on chip multi-processors
AU - Song, Fengguang
AU - Moore, Shirley
AU - Dongarra, Jack
PY - 2007
Y1 - 2007
N2 - It is critical to provide high performance for scientific applications running on Chip Multi-Processors (CMP). A CMP architecture often comprises a shared L2 cache and lower-level storages. The shared L2 cache can reduce the number of cache misses if the data are accessed in common by several threads, but it can also lead to performance degradation due to resource contention. Sometimes running threads on all cores can cause severe contention and increase the number of cache misses greatly. To investigate how the performance of a thread varies when running it concurrently with other threads on the remaining cores, we develop an analytical model to predict the number of misses on the shared L2 cache. In particular, we apply the model to thread-parallel numerical programs. We assume that all the threads compute homogeneous tasks and share a fully associative L2 cache. We use circular sequence profiling and stack processing techniques to analyze the L2 cache trace to predict the number of compulsory cache misses, capacity cache misses on shared data, and capacity cache misses on private data, respectively. Our method is able to predict the L2 cache performance for threads that have a global shared address space. For scientific applications, threads often have overlapping memory footprints. We use a cycle accurate simulator to validate the model with three scientific programs: dense matrix multiplication, blocked dense matrix multiplication, and sparse matrix-vector product. The average relative errors for the three experiments are 8.01%, 1.85%, and 2.41%, respectively.
AB - It is critical to provide high performance for scientific applications running on Chip Multi-Processors (CMP). A CMP architecture often comprises a shared L2 cache and lower-level storages. The shared L2 cache can reduce the number of cache misses if the data are accessed in common by several threads, but it can also lead to performance degradation due to resource contention. Sometimes running threads on all cores can cause severe contention and increase the number of cache misses greatly. To investigate how the performance of a thread varies when running it concurrently with other threads on the remaining cores, we develop an analytical model to predict the number of misses on the shared L2 cache. In particular, we apply the model to thread-parallel numerical programs. We assume that all the threads compute homogeneous tasks and share a fully associative L2 cache. We use circular sequence profiling and stack processing techniques to analyze the L2 cache trace to predict the number of compulsory cache misses, capacity cache misses on shared data, and capacity cache misses on private data, respectively. Our method is able to predict the L2 cache performance for threads that have a global shared address space. For scientific applications, threads often have overlapping memory footprints. We use a cycle accurate simulator to validate the model with three scientific programs: dense matrix multiplication, blocked dense matrix multiplication, and sparse matrix-vector product. The average relative errors for the three experiments are 8.01%, 1.85%, and 2.41%, respectively.
KW - Architecture
KW - Cache performance modeling
KW - Chip multi-processor
KW - Multi-threaded programming
UR - http://www.scopus.com/inward/record.url?scp=47249151449&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2007.52
DO - 10.1109/ICPP.2007.52
M3 - Conference contribution
AN - SCOPUS:47249151449
SN - 076952933X
SN - 9780769529332
T3 - Proceedings of the International Conference on Parallel Processing
SP - 51
BT - 2007 International Conference on Parallel Processing, ICPP
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th International Conference on Parallel Processing in Xi'an, ICPP
Y2 - 10 September 2007 through 14 September 2007
ER -