Evaluation of hardware data prefetchers on server processors

, Article ACM Computing Surveys ; Volume 52, Issue 3 , 2019 ; 03600300 (ISSN) Bakhshalipour, M ; Tabaeiaghdaei, S ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Association for Computing Machinery 2019

Abstract

Data prefetching, i.e., the act of predicting an application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: Nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way. In this survey, we evaluate the effectiveness of data prefetching in the context of...

Bingo spatial data prefetcher

, Article 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, 16 February 2019 through 20 February 2019 ; 2019 , Pages 399-411 ; 9781728114446 (ISBN) Bakhshalipour, M ; Shakerinava, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

Applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide the long latency of DRAM accesses. While state-of-the-art spatial data prefetchers are effective at reducing the number of data misses, we observe that there is still significant room for improvement. To select an access pattern for prefetching, existing spatial prefetchers associate observed access patterns to either a short event with a high probability of recurrence or a long event with a low probability of recurrence. Consequently, the...

Blenda: dynamically-reconfigurable stacked DRAM

, Article Proceedings of the Annual International Symposium on Microarchitecture, MICRO ; 2024 , Pages 1323-1337 ; 10724451 (ISSN); 979-835035057-9 (ISBN) Bakhshalipour, M ; Zare, H ; Samandi, F ; Golshan, F ; Lotfi-Kamran, P ; Sarbazi-Azad, H

IEEE 2024

Abstract

This paper proposes Blenda, a dynamically-partitioned memory-cache blend architecture for giga-scale die-stacked DRAMs. Blenda architects the stacked DRAM partly as memory and partly as cache, and dynamically adjusts each part's size to workloads' demands. The memory part hosts hot data objects and serves requests to them efficiently (i.e., without metadata overheads). The cache part captures transient data and filters requests to bandwidth-limited off-chip DRAM. Blenda provides three key contributions: (i) Blenda partitions stacked DRAM's capacity in a workload-aware manner: different workloads enjoy different memory-cache configurations. (ii) Blenda is reactive: the configuration is...

Cache replacement policy based on expected hit count

, Article IEEE Computer Architecture Letters ; 2017 ; 15566056 (ISSN) Vakil Ghahani, A ; Mahdizadeh Shahri, S ; Lotfi Namin, M ; Bakhshalipour, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

2017

Abstract

Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose significant performance potential. One of the components of a modern processor that has a prominent influence on the off-chip miss traffic is LLC's replacement policy. Existing processors employ a variation of least recently used (LRU) policy to determine the victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady's MIN, which is the optimal replacement policy. Belady's MIN requires selecting a victim with the longest reuse distance,...

Dark silicon and the history of computing

, Article Advances in Computers ; Volume 110 , 2018 , Pages 1-33 ; 00652458 (ISSN); 9780128153581 (ISBN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2018

Abstract

For many years, computer designers benefitted from Moore's law and Dennard scaling to significantly improve the speed of single-core processors. The failure of Dennard scaling pushed the computer industry toward homogenous multicore processors for the performance improvement to continue without significant increase in power consumption. Unfortunately, even homogeneous multicore processors cannot offer the level of energy efficiency required to operate all the cores at the same time in today's and especially tomorrow's technologies. As a result of lack of energy efficiency, not all the cores in a multicore processor can be functional at the same time. This phenomenon is referred to as dark...

Temporal prefetching

, Article Advances in Computers ; 2021 ; 00652458 (ISSN) Lotfi-Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2021

Abstract

Many applications, including big-data server applications, frequently encounter data misses. Consequently, they lose significant performance potential. Fortunately, data accesses of many of these applications follow temporal correlations, which means data accesses repeat over time. Temporal correlations occur because applications usually consist of loops, and hence, the sequence of instructions that constitute the body of a loop repeats many times, leading to data access repetition. Temporal data prefetchers take advantage of temporal correlation to predict and prefetch future memory accesses. In this chapter, we introduce the concept of temporal prefetching and present two instances of...

Spatial prefetching

, Article Advances in Computers ; 2021 ; 00652458 (ISSN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2021

Abstract

Many applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide their long latency. Spatial prefetchers are particularly of interest because they usually only need a small storage budget. In this chapter, we introduce the concept of spatial prefetching and present two instances of spatial data prefetchers, SMS and VLDP. © 2021 Elsevier Inc

Preface

, Article Advances in Computers ; Volume 125 , 2022 , Pages ix-x ; 00652458 (ISSN); 9780323851190 (ISBN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2022

Temporal prefetching

, Article Advances in Computers ; Volume 125 , 2022 , Pages 31-41 ; 00652458 (ISSN); 9780323851190 (ISBN) Lotfi-Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2022

Abstract

Many applications, including big-data server applications, frequently encounter data misses. Consequently, they lose significant performance potential. Fortunately, data accesses of many of these applications follow temporal correlations, which means data accesses repeat over time. Temporal correlations occur because applications usually consist of loops, and hence, the sequence of instructions that constitute the body of a loop repeats many times, leading to data access repetition. Temporal data prefetchers take advantage of temporal correlation to predict and prefetch future memory accesses. In this chapter, we introduce the concept of temporal prefetching and present two instances of...

Spatial prefetching

, Article Advances in Computers ; Volume 125 , 2022 , Pages 19-29 ; 00652458 (ISSN); 9780323851190 (ISBN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2022

Abstract

Many applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide their long latency. Spatial prefetchers are particularly of interest because they usually only need a small storage budget. In this chapter, we introduce the concept of spatial prefetching and present two instances of spatial data prefetchers, SMS and VLDP. © 2022 Elsevier Inc

MANA: Microarchitecting a temporal instruction prefetcher

, Article IEEE Transactions on Computers ; 2022 , Pages 1-1 ; 00189340 (ISSN) Ansari, A ; Golshan, F ; Barati, R ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

IEEE Computer Society 2022

Abstract

L1 instruction(L1-l) cache misses are a source of performance bottleneck. While many instruction prefetchers have been proposed, most of them leave a considerable potential uncovered. In 2011, Proactive Instruction Fetch (PIF) showed that a hardware prefetcher could effectively eliminate all instruction-cache misses. However, its enormous storage cost makes it impractical. Consequently, reducing the storage cost was the main research focus in instruction prefetching in the past decade. Several instruction prefetchers, including RDIP and Shotgun, were proposed to offer PIF-level performance with significantly lower storage overhead. However, our findings show that there is a considerable...

State-of-the-art data prefetchers

, Article Advances in Computers ; Volume 125 , 2022 , Pages 55-67 ; 00652458 (ISSN); 9780323851190 (ISBN) Shakerinava, M ; Golshan, F ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2022

Abstract

We introduced several styles of data prefetching in the past three chapters. The introduced data prefetchers were known for a long time, sometimes for decades. In this chapter, we introduce several state-of-the-art data prefetchers, which have been introduced in the past few years. In particular, we introduce DOMINO, BINGO, MLOP, and RUNAHEAD METADATA. © 2022 Elsevier Inc

Evaluation of data prefetchers

, Article Advances in Computers ; Volume 125 , 2022 , Pages 69-89 ; 00652458 (ISSN); 9780323851190 (ISBN) Shakerinava, M ; Golshan, F ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2022

Abstract

We introduced several data prefetchers and qualitatively discussed their strengths and weaknesses. Without quantitative evaluation, the true strengths and weaknesses of a data prefetcher are still vague. To shed light on the strengths and weaknesses of the introduced data prefetchers and to enable the readers to better understand these prefetchers, in this chapter, we quantitatively compare and contrast them. © 2022 Elsevier Inc

MANA: Microarchitecting a temporal instruction prefetcher

, Article IEEE Transactions on Computers ; Volume 72, Issue 3 , 2023 , Pages 732-743 ; 00189340 (ISSN) Ansari, A ; Golshan, F ; Barati, R ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

IEEE Computer Society 2023

Abstract

L1 instruction (L1-I) cache misses are a source of performance bottleneck. While many instruction prefetchers have been proposed over the years, most of them leave a considerable potential uncovered. In 2011, Proactive Instruction Fetch (PIF) showed that a hardware prefetcher could effectively eliminate all instruction-cache misses. However, its enormous storage cost makes it an impractical solution. Consequently, reducing the storage cost was the main research focus in instruction prefetching in the past decade. Several instruction prefetchers, including RDIP and Shotgun, were proposed to offer PIF-level performance with significantly lower storage overhead. However, our findings show that...

Cache replacement policy based on expected hit count

, Article IEEE Computer Architecture Letters ; Volume 17, Issue 1 , 2018 , Pages 64-67 ; 15566056 (ISSN) Vakil Ghahani, A ; Mahdizadeh Shahri, S ; Lotfi Namin, M. R ; Bakhshalipour, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose significant performance potential. One of the components of a modern processor that has a prominent influence on the off-chip miss traffic is LLC's replacement policy. Existing processors employ a variation of least recently used (LRU) policy to determine the victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady's MIN, which is the optimal replacement policy. Belady's MIN requires selecting a victim with the longest reuse distance,...

Harnessing pairwise-correlating data prefetching with runahead metadata

, Article IEEE Computer Architecture Letters ; Volume 19, Issue 2 , 2020 , Pages 130-133 ; ISSN: 15566056 Golshan, F ; Bakhshalipour, M ; Shakerinava, M ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2020

Abstract

Recent research revisits pairwise-correlating data prefetching due to its extremely low overhead. Pairwise-correlating data prefetching, however, cannot accurately detect where data streams end. As a result, pairwise-correlating data prefetchers either expose low accuracy or they lose timeliness when they are performing multi-degree prefetching. In this letter, we propose a novel technique to detect where data streams end and hence, control the multi-degree prefetching in the context of pairwise-correlated prefetchers. The key idea is to have a separate metadata table that operates one step ahead of the main metadata table. This way, the runahead metadata table harnesses the degree of...

An efficient hybrid-switched network-on-chip for chip multiprocessors

, Article IEEE Transactions on Computers ; Volume 65, Issue 5 , 2016 , Pages 1656-1662 ; 00189340 (ISSN) Lotfi Kamran, P ; Modarressi, M ; Sarbazi Azad, H ; Sharif University of Technology

IEEE Computer Society 2016

Abstract

Chip multiprocessors (CMPs) require a low-latency interconnect fabric network-on-chip (NoC) to minimize processor stall time on instruction and data accesses that are serviced by the last-level cache (LLC). While packet-switched mesh interconnects sacrifice performance of many-core processors due to NoC-induced delays, existing circuit-switched interconnects do not offer lower network delays as they cannot hide the time it takes to set up a circuit. To address this problem, this work introduces CIMA - a hybrid circuit-switched and packet-switched mesh-based interconnection network that affords low LLC access delays at a small area cost. CIMA uses virtual cut-through (VCT) switching for short...

Near-Ideal networks-on-chip for servers

, Article 23rd IEEE Symposium on High Performance Computer Architecture, HPCA 2017, 4 February 2017 through 8 February 2017 ; 2017 , Pages 277-288 ; 15300897 (ISSN); 9781509049851 (ISBN) Lotfi Kamran, P ; Modarressi, M ; Sarbazi Azad, H ; Sharif University of Technology

IEEE Computer Society 2017

Abstract

Server workloads benefit from execution on many-core processors due to their massive request-level parallelism. A key characteristic of server workloads is the large instruction footprints. While a shared last-level cache (LLC) captures the footprints, it necessitates a low-latency network-on-chip (NOC) to minimize the core stall time on accesses serviced by the LLC. As strict quality-of-service requirements preclude the use of lean cores in server processors, we observe that even state-of-the-art single-cycle multi-hop NOCs are far from ideal because they impose significant NOC-induced delays on the LLC access latency, and diminish performance. Most of the NOC delay is due to per-hop...

Binary Taylor Diagrams: An efficient implementation of Taylor expansion Diagrams

, Article IEEE International Symposium on Circuits and Systems 2005, ISCAS 2005, Kobe, 23 May 2005 through 26 May 2005 ; 2005 , Pages 424-427 ; 02714310 (ISSN) Hooshmand, A ; Shamshiri, S ; Alisafaee, M ; Lotfi Kamran, P ; Naderi, M ; Navabi, Z ; Alizadeh, B ; Sharif University of Technology

2005

Abstract

This paper presents an efficient way of implementing Taylor expansion Diagrams (TED) that is called Binary Taylor Diagrams (BTD). BTD is based on Taylor series like TED, but uses a binary data structure. So BTD functions are simpler than those of TED. © 2005 IEEE

Reducing writebacks through in-cache displacement

, Article ACM Transactions on Design Automation of Electronic Systems ; Volume 24, Issue 2 , 2019 ; 10844309 (ISSN) Bakhshalipour, M ; Faraji, A ; Vakil Ghahani, S. A ; Samandi, F ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Association for Computing Machinery 2019

Abstract

Non-Volatile Memory (NVM) technology is a promising solution to fulfill the ever-growing need for higher capacity in the main memory of modern systems. Despite having many great features, however, NVM's poor write performance remains a severe obstacle, preventing it from being used as a DRAM alternative in the main memory. Most of the prior work targeted optimizing writes at the main memory side and neglected the decisive role of upper-level cache management policies on reducing the number of writes. In this article, we propose a novel cache management policy that attempts to maximize write-coalescing in the on-chip SRAM last-level cache (LLC) for the sake of reducing the number of costly...