TY - GEN
T1 - SELF
T2 - 25th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2017
AU - Guo, Yuhua
AU - Liu, Qing
AU - Xiao, Weijun
AU - Huang, Ping
AU - Podhorszki, Norbert
AU - Klasky, Scott
AU - He, Xubin
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/11/13
Y1 - 2017/11/13
N2 - Die-stacked DRAM (a.k.a., on-chip DRAM) provides much higher bandwidth and lower latency than off-chip DRAM. It is a promising technology to break the 'memory wall'. Die-stacked DRAM can be used either as a cache (i.e., DRAM cache) or as a part of memory (PoM). A DRAM cache design would suffer from more page faults than a PoM design as the DRAM cache cannot contribute towards capacity of main memory. At the same time, obtaining high performance requires PoM systems to swap requested data to the die-stacked DRAM. Existing PoM designs fall into two categories line-based and page-based. The former ensures low off-chip bandwidth utilization but suffers from a low hit ratio of on-chip memory due to limited temporal locality. In contrast, page-based designs achieve a high hit ratio of on-chip memory albeit at the cost of moving large amounts of data between on-chip and off-chip memories, leading to increased off-chip bandwidth utilization and significant system performance degradation.To achieve a similar high hit ratio of on-chip memory as page-based designs, and eliminate excessive off-chip traffic involved, we propose SELF, a high performance and bandwidth efficient approach. The key idea is to SElectively swap Lines in a requested page that are likely to be accessed according to page Footprint, instead of blindly swapping an entire page. In doing so, SELF allows incoming requests to be serviced from the on-chip memory as much as possible, while avoiding swapping unused lines to reduce memory bandwidth consumption. We evaluate a memory system which consists of 4GB on-chip DRAM and 12GB off-chip DRAM. Compared to a baseline system that has the same total capacity of 16GB off-chip DRAM, SELF improves the performance in terms of instructions per cycle by 26.9%, and reduces the energy consumption per memory access by 47.9% on average. In contrast, state-of-the-art line-based and page-based PoM designs can only improve the performance by 9.5% and 9.9%, respectively, against the same baseline system.
AB - Die-stacked DRAM (a.k.a., on-chip DRAM) provides much higher bandwidth and lower latency than off-chip DRAM. It is a promising technology to break the 'memory wall'. Die-stacked DRAM can be used either as a cache (i.e., DRAM cache) or as a part of memory (PoM). A DRAM cache design would suffer from more page faults than a PoM design as the DRAM cache cannot contribute towards capacity of main memory. At the same time, obtaining high performance requires PoM systems to swap requested data to the die-stacked DRAM. Existing PoM designs fall into two categories line-based and page-based. The former ensures low off-chip bandwidth utilization but suffers from a low hit ratio of on-chip memory due to limited temporal locality. In contrast, page-based designs achieve a high hit ratio of on-chip memory albeit at the cost of moving large amounts of data between on-chip and off-chip memories, leading to increased off-chip bandwidth utilization and significant system performance degradation.To achieve a similar high hit ratio of on-chip memory as page-based designs, and eliminate excessive off-chip traffic involved, we propose SELF, a high performance and bandwidth efficient approach. The key idea is to SElectively swap Lines in a requested page that are likely to be accessed according to page Footprint, instead of blindly swapping an entire page. In doing so, SELF allows incoming requests to be serviced from the on-chip memory as much as possible, while avoiding swapping unused lines to reduce memory bandwidth consumption. We evaluate a memory system which consists of 4GB on-chip DRAM and 12GB off-chip DRAM. Compared to a baseline system that has the same total capacity of 16GB off-chip DRAM, SELF improves the performance in terms of instructions per cycle by 26.9%, and reduces the energy consumption per memory access by 47.9% on average. In contrast, state-of-the-art line-based and page-based PoM designs can only improve the performance by 9.5% and 9.9%, respectively, against the same baseline system.
KW - Bandwidth Efficient
KW - DRAM cache
KW - Die-stacked DRAM
KW - Hardware-managed PoM
KW - Hybrid Memory Systems
KW - Part of Memory
UR - http://www.scopus.com/inward/record.url?scp=85040507095&partnerID=8YFLogxK
U2 - 10.1109/MASCOTS.2017.23
DO - 10.1109/MASCOTS.2017.23
M3 - Conference contribution
AN - SCOPUS:85040507095
T3 - Proceedings - 25th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2017
SP - 187
EP - 197
BT - Proceedings - 25th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 20 September 2017 through 22 September 2017
ER -