Sharif Digital Repository / Sharif University of Technology / Search result

Code layout optimization for Near-Ideal instruction cache

, Article IEEE Computer Architecture Letters ; Volume 18, Issue 2 , 2019 , Pages 124-127 ; 15566056 (ISSN) Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

Instruction cache misses are a significant source of performance degradation in server workloads because of their large instruction footprints and complex control flow. Due to the importance of reducing the number of instruction cache misses, there has been a myriad of proposals for hardware instruction prefetchers in the past two decades. While effectual, state-of-the-art hardware instruction prefetchers either impose considerable storage overhead or require significant changes in the frontend of a processor. Unlike hardware instruction prefetchers, code-layout optimization techniques profile a program and then reorder the code layout of the program to increase spatial locality, and hence,...

Divide and conquer frontend bottleneck

, Article 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, 30 May 2020 through 3 June 2020 ; Volume 2020-May , 2020 , Pages 65-78 Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2020

Abstract

The frontend stalls caused by instruction and BTB misses are a significant source of performance degradation in server processors. Prefetchers are commonly employed to mitigate frontend bottleneck. However, next-line prefetchers, which are available in server processors, are incapable of eliminating a considerable number of L1 instruction misses. Temporal instruction prefetchers, on the other hand, effectively remove most of the instruction and BTB misses but impose significant area overhead. Recently, an old idea of using BTB-directed instruction prefetching is revived to address the limitations of temporal instruction prefetchers. While this approach leads to prefetchers with low area...

Evaluation of hardware data prefetchers on server processors

, Article ACM Computing Surveys ; Volume 52, Issue 3 , 2019 ; 03600300 (ISSN) Bakhshalipour, M ; Tabaeiaghdaei, S ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Association for Computing Machinery 2019

Abstract

Data prefetching, i.e., the act of predicting an application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: Nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way. In this survey, we evaluate the effectiveness of data prefetching in the context of...

Bingo spatial data prefetcher

, Article 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, 16 February 2019 through 20 February 2019 ; 2019 , Pages 399-411 ; 9781728114446 (ISBN) Bakhshalipour, M ; Shakerinava, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

Applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide the long latency of DRAM accesses. While state-of-the-art spatial data prefetchers are effective at reducing the number of data misses, we observe that there is still significant room for improvement. To select an access pattern for prefetching, existing spatial prefetchers associate observed access patterns to either a short event with a high probability of recurrence or a long event with a low probability of recurrence. Consequently, the...

MANA: Microarchitecting a temporal instruction prefetcher

, Article IEEE Transactions on Computers ; 2022 , Pages 1-1 ; 00189340 (ISSN) Ansari, A ; Golshan, F ; Barati, R ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

IEEE Computer Society 2022

Abstract

L1 instruction(L1-l) cache misses are a source of performance bottleneck. While many instruction prefetchers have been proposed, most of them leave a considerable potential uncovered. In 2011, Proactive Instruction Fetch (PIF) showed that a hardware prefetcher could effectively eliminate all instruction-cache misses. However, its enormous storage cost makes it impractical. Consequently, reducing the storage cost was the main research focus in instruction prefetching in the past decade. Several instruction prefetchers, including RDIP and Shotgun, were proposed to offer PIF-level performance with significantly lower storage overhead. However, our findings show that there is a considerable...

State-of-the-art data prefetchers

, Article Advances in Computers ; Volume 125 , 2022 , Pages 55-67 ; 00652458 (ISSN); 9780323851190 (ISBN) Shakerinava, M ; Golshan, F ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2022

Abstract

We introduced several styles of data prefetching in the past three chapters. The introduced data prefetchers were known for a long time, sometimes for decades. In this chapter, we introduce several state-of-the-art data prefetchers, which have been introduced in the past few years. In particular, we introduce DOMINO, BINGO, MLOP, and RUNAHEAD METADATA. © 2022 Elsevier Inc

Evaluation of data prefetchers

, Article Advances in Computers ; Volume 125 , 2022 , Pages 69-89 ; 00652458 (ISSN); 9780323851190 (ISBN) Shakerinava, M ; Golshan, F ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Academic Press Inc 2022

Abstract

We introduced several data prefetchers and qualitatively discussed their strengths and weaknesses. Without quantitative evaluation, the true strengths and weaknesses of a data prefetcher are still vague. To shed light on the strengths and weaknesses of the introduced data prefetchers and to enable the readers to better understand these prefetchers, in this chapter, we quantitatively compare and contrast them. © 2022 Elsevier Inc

Cache replacement policy based on expected hit count

, Article IEEE Computer Architecture Letters ; 2017 ; 15566056 (ISSN) Vakil Ghahani, A ; Mahdizadeh Shahri, S ; Lotfi Namin, M ; Bakhshalipour, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Abstract

Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose significant performance potential. One of the components of a modern processor that has a prominent influence on the off-chip miss traffic is LLC's replacement policy. Existing processors employ a variation of least recently used (LRU) policy to determine the victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady's MIN, which is the optimal replacement policy. Belady's MIN requires selecting a victim with the longest reuse distance,...

Fast data delivery for many-core processors

, Article IEEE Transactions on Computers ; Volume 67, Issue 10 , 2018 , Pages 1416-1429 ; 00189340 (ISSN) Bakhshalipour, M ; Lotfi Kamran, P ; Mazloumi, A ; Samandi, F ; Naderan Tahan, M ; Modarressi, M ; Sarbazi Azad, H ; Sharif University of Technology

Abstract

Server workloads operate on large volumes of data. As a result, processors executing these workloads encounter frequent L1-D misses. In a many-core processor, an L1-D miss causes a request packet to be sent to an LLC slice and a response packet to be sent back to the L1-D, which results in high overhead. While prior work targeted response packets, this work focuses on accelerating the request packets. Unlike aggressive OoO cores, simpler cores used in many-core processors cannot hide the latency of L1-D request packets. We observe that LLC slices that serve L1-D misses are strongly temporally correlated. Taking advantage of this observation, we design a simple and accurate predictor. Upon...

Domino temporal data prefetcher

, Article Proceedings - International Symposium on High-Performance Computer Architecture ; Volume 2018-February , 2018 , Pages 131-142 ; 15300897 (ISSN); 9781538636596 (ISBN) Bakhshalipour, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Bitmain; DeePhi; et al.; Huawei; IBM; Intel ; Sharif University of Technology

IEEE Computer Society 2018

Abstract

Big-data server applications frequently encounter data misses, and hence, lose significant performance potential. One way to reduce the number of data misses or their effect is data prefetching. As data accesses have high temporal correlations, temporal prefetching techniques are promising for them. While state-of-the-art temporal prefetching techniques are effective at reducing the number of data misses, we observe that there is a significant gap between what they offer and the opportunity. This work aims to improve the effectiveness of temporal prefetching techniques. We identify the lookup mechanism of existing temporal prefetchers responsible for the large gap between what they offer and...

Cache replacement policy based on expected hit count

, Article IEEE Computer Architecture Letters ; Volume 17, Issue 1 , 2018 , Pages 64-67 ; 15566056 (ISSN) Vakil Ghahani, A ; Mahdizadeh Shahri, S ; Lotfi Namin, M. R ; Bakhshalipour, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose significant performance potential. One of the components of a modern processor that has a prominent influence on the off-chip miss traffic is LLC's replacement policy. Existing processors employ a variation of least recently used (LRU) policy to determine the victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady's MIN, which is the optimal replacement policy. Belady's MIN requires selecting a victim with the longest reuse distance,...

Reducing writebacks through in-cache displacement

, Article ACM Transactions on Design Automation of Electronic Systems ; Volume 24, Issue 2 , 2019 ; 10844309 (ISSN) Bakhshalipour, M ; Faraji, A ; Vakil Ghahani, S. A ; Samandi, F ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Association for Computing Machinery 2019

Abstract

Non-Volatile Memory (NVM) technology is a promising solution to fulfill the ever-growing need for higher capacity in the main memory of modern systems. Despite having many great features, however, NVM's poor write performance remains a severe obstacle, preventing it from being used as a DRAM alternative in the main memory. Most of the prior work targeted optimizing writes at the main memory side and neglected the decisive role of upper-level cache management policies on reducing the number of writes. In this article, we propose a novel cache management policy that attempts to maximize write-coalescing in the on-chip SRAM last-level cache (LLC) for the sake of reducing the number of costly...

Harnessing pairwise-correlating data prefetching with runahead metadata

, Article IEEE Computer Architecture Letters ; Volume 19, Issue 2 , 2020 , Pages 130-133 ; ISSN: 15566056 Golshan, F ; Bakhshalipour, M ; Shakerinava, M ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2020

Abstract

Recent research revisits pairwise-correlating data prefetching due to its extremely low overhead. Pairwise-correlating data prefetching, however, cannot accurately detect where data streams end. As a result, pairwise-correlating data prefetchers either expose low accuracy or they lose timeliness when they are performing multi-degree prefetching. In this letter, we propose a novel technique to detect where data streams end and hence, control the multi-degree prefetching in the context of pairwise-correlated prefetchers. The key idea is to have a separate metadata table that operates one step ahead of the main metadata table. This way, the runahead metadata table harnesses the degree of...

Data-Aware compression of neural networks

, Article IEEE Computer Architecture Letters ; Volume 20, Issue 2 , 2021 , Pages 94-97 ; 15566056 (ISSN) Falahati, H ; Peyro, M ; Amini, H ; Taghian, M ; Sadrosadati, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2021

Abstract

Deep Neural networks (DNNs) are getting deeper and larger which intensify the data movement and compute demands. Prior work focuses on reducing data movements and computation through exploiting sparsity and similarity. However, none of them exploit input similarity and only focus on sparsity and weight similarity. Synergistically analysing the similarity and sparsity of inputs and weights, we show that memory accesses and computations can be reduced by 5.7× and 4.1×, more than what can be decreased by exploiting only sparsity, and 3.9× and 2.1×, more than what can be decreased by exploiting only weight similarity. We propose a new data-aware compression approach, called DANA, to effectively...

Binary Taylor Diagrams: An efficient implementation of Taylor expansion Diagrams

, Article IEEE International Symposium on Circuits and Systems 2005, ISCAS 2005, Kobe, 23 May 2005 through 26 May 2005 ; 2005 , Pages 424-427 ; 02714310 (ISSN) Hooshmand, A ; Shamshiri, S ; Alisafaee, M ; Lotfi Kamran, P ; Naderi, M ; Navabi, Z ; Alizadeh, B ; Sharif University of Technology

2005

Abstract

This paper presents an efficient way of implementing Taylor expansion Diagrams (TED) that is called Binary Taylor Diagrams (BTD). BTD is based on Taylor series like TED, but uses a binary data structure. So BTD functions are simpler than those of TED. © 2005 IEEE