Loading...
Search for: lotfi-kamran--pejman
0.142 seconds

    Improving the Efficiency of On-chip 3D Stacked DRAM in Server Processors

    , M.Sc. Thesis Sharif University of Technology Samandi, Farid (Author) ; Sarbazi-Azad, Hamid (Supervisor) ; Lotfi Kamran, Pejman (Co-Advisor)
    Abstract
    Big-data server workloads have vast datasets, and hence, frequently access off-chip memory for data. Consequently, server workloads lose significant performance potential due to off-chip latency and bandwidth walls. Recent research advocates using 3D stacked DRAM to break the walls. As 3D stacked DRAM cannot accommodate the whole datasets of server workloads, most proposals use 3D DRAM as a large cache. Unfortunately, a large DRAM cache imposes latency overhead due to (1) the need for tag lookup and (2) inefficient utilization of on-chip and off-chip bandwidth, and as a result, lowers the benefits of 3D stacked DRAM. Moreover, storing the tags of a multi-gigabyte DRAM cache requires changes... 

    Designing Instruction Prefetcher with Low Area Overhead for Server Workloads

    , M.Sc. Thesis Sharif University of Technology Faghih, Faezeh (Author) ; Sarbazi Azad, Hamid (Supervisor) ; Lotfi Kamran, Pejman (Co-Supervisor)
    Abstract
    L1 instruction cache misses creates a crucial performance bottleneck for server applications. Server applications extensively use operating system services, and as such, have large instruction footprint that dwarfs instruction cache size. Meanwhile, fast access requirements preclude enlarging instruction cache that can hold the whole instruction footprint of current server workloads. Prior works proposed using hardware prefetching schemes to eliminate or reduce the effect of instruction cache misses. They use the fact that server application instruction sequences are repetitive. So by recording and prefetching based on such sequesnces, L1 insruction misses could be reduced. While they... 

    Speculative Path Setup for Fast Data Delivery in Server Processors

    , M.Sc. Thesis Sharif University of Technology Bakhshalipour, Mohammad (Author) ; Sarbazi-Azad, Hamid (Supervisor) ; Lotfi-Kamran, Pejman (Co-Advisor)
    Abstract
    Server workloads operate on large volumes of data. As a result, processors executing these workloads encounter frequent L1-D misses. An L1-D miss causes a request packet to be sent to an LLC slice and a response packet to be sent back to the L1-D, which results in high overhead. While prior work targeted response packets, this work focuses on accelerating the request packets through a simple-yet-effective predictor. Upon the occurrence of an L1-D miss, the predictor identifies the LLC slice that will serve the next L1-D miss and a circuit will be set up for the upcoming miss request to accelerate its transmission. When the upcoming miss occurs, the resulting request can use the already... 

    Dark silicon and the history of computing

    , Article Advances in Computers ; Volume 110 , 2018 , Pages 1-33 ; 00652458 (ISSN); 9780128153581 (ISBN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2018
    Abstract
    For many years, computer designers benefitted from Moore's law and Dennard scaling to significantly improve the speed of single-core processors. The failure of Dennard scaling pushed the computer industry toward homogenous multicore processors for the performance improvement to continue without significant increase in power consumption. Unfortunately, even homogeneous multicore processors cannot offer the level of energy efficiency required to operate all the cores at the same time in today's and especially tomorrow's technologies. As a result of lack of energy efficiency, not all the cores in a multicore processor can be functional at the same time. This phenomenon is referred to as dark... 

    Temporal prefetching

    , Article Advances in Computers ; 2021 ; 00652458 (ISSN) Lotfi-Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2021
    Abstract
    Many applications, including big-data server applications, frequently encounter data misses. Consequently, they lose significant performance potential. Fortunately, data accesses of many of these applications follow temporal correlations, which means data accesses repeat over time. Temporal correlations occur because applications usually consist of loops, and hence, the sequence of instructions that constitute the body of a loop repeats many times, leading to data access repetition. Temporal data prefetchers take advantage of temporal correlation to predict and prefetch future memory accesses. In this chapter, we introduce the concept of temporal prefetching and present two instances of... 

    Spatial prefetching

    , Article Advances in Computers ; 2021 ; 00652458 (ISSN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2021
    Abstract
    Many applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide their long latency. Spatial prefetchers are particularly of interest because they usually only need a small storage budget. In this chapter, we introduce the concept of spatial prefetching and present two instances of spatial data prefetchers, SMS and VLDP. © 2021 Elsevier Inc  

    Preface

    , Article Advances in Computers ; Volume 125 , 2022 , Pages ix-x ; 00652458 (ISSN); 9780323851190 (ISBN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2022

    Temporal prefetching

    , Article Advances in Computers ; Volume 125 , 2022 , Pages 31-41 ; 00652458 (ISSN); 9780323851190 (ISBN) Lotfi-Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2022
    Abstract
    Many applications, including big-data server applications, frequently encounter data misses. Consequently, they lose significant performance potential. Fortunately, data accesses of many of these applications follow temporal correlations, which means data accesses repeat over time. Temporal correlations occur because applications usually consist of loops, and hence, the sequence of instructions that constitute the body of a loop repeats many times, leading to data access repetition. Temporal data prefetchers take advantage of temporal correlation to predict and prefetch future memory accesses. In this chapter, we introduce the concept of temporal prefetching and present two instances of... 

    Spatial prefetching

    , Article Advances in Computers ; Volume 125 , 2022 , Pages 19-29 ; 00652458 (ISSN); 9780323851190 (ISBN) Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2022
    Abstract
    Many applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide their long latency. Spatial prefetchers are particularly of interest because they usually only need a small storage budget. In this chapter, we introduce the concept of spatial prefetching and present two instances of spatial data prefetchers, SMS and VLDP. © 2022 Elsevier Inc  

    An efficient hybrid-switched network-on-chip for chip multiprocessors

    , Article IEEE Transactions on Computers ; Volume 65, Issue 5 , 2016 , Pages 1656-1662 ; 00189340 (ISSN) Lotfi Kamran, P ; Modarressi, M ; Sarbazi Azad, H ; Sharif University of Technology
    IEEE Computer Society  2016
    Abstract
    Chip multiprocessors (CMPs) require a low-latency interconnect fabric network-on-chip (NoC) to minimize processor stall time on instruction and data accesses that are serviced by the last-level cache (LLC). While packet-switched mesh interconnects sacrifice performance of many-core processors due to NoC-induced delays, existing circuit-switched interconnects do not offer lower network delays as they cannot hide the time it takes to set up a circuit. To address this problem, this work introduces CIMA - a hybrid circuit-switched and packet-switched mesh-based interconnection network that affords low LLC access delays at a small area cost. CIMA uses virtual cut-through (VCT) switching for short... 

    Near-Ideal networks-on-chip for servers

    , Article 23rd IEEE Symposium on High Performance Computer Architecture, HPCA 2017, 4 February 2017 through 8 February 2017 ; 2017 , Pages 277-288 ; 15300897 (ISSN); 9781509049851 (ISBN) Lotfi Kamran, P ; Modarressi, M ; Sarbazi Azad, H ; Sharif University of Technology
    IEEE Computer Society  2017
    Abstract
    Server workloads benefit from execution on many-core processors due to their massive request-level parallelism. A key characteristic of server workloads is the large instruction footprints. While a shared last-level cache (LLC) captures the footprints, it necessitates a low-latency network-on-chip (NOC) to minimize the core stall time on accesses serviced by the LLC. As strict quality-of-service requirements preclude the use of lean cores in server processors, we observe that even state-of-the-art single-cycle multi-hop NOCs are far from ideal because they impose significant NOC-induced delays on the LLC access latency, and diminish performance. Most of the NOC delay is due to per-hop... 

    Evaluation of hardware data prefetchers on server processors

    , Article ACM Computing Surveys ; Volume 52, Issue 3 , 2019 ; 03600300 (ISSN) Bakhshalipour, M ; Tabaeiaghdaei, S ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Association for Computing Machinery  2019
    Abstract
    Data prefetching, i.e., the act of predicting an application's future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: Nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way. In this survey, we evaluate the effectiveness of data prefetching in the context of... 

    Bingo spatial data prefetcher

    , Article 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, 16 February 2019 through 20 February 2019 ; 2019 , Pages 399-411 ; 9781728114446 (ISBN) Bakhshalipour, M ; Shakerinava, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2019
    Abstract
    Applications extensively use data objects with a regular and fixed layout, which leads to the recurrence of access patterns over memory regions. Spatial data prefetching techniques exploit this phenomenon to prefetch future memory references and hide the long latency of DRAM accesses. While state-of-the-art spatial data prefetchers are effective at reducing the number of data misses, we observe that there is still significant room for improvement. To select an access pattern for prefetching, existing spatial prefetchers associate observed access patterns to either a short event with a high probability of recurrence or a long event with a low probability of recurrence. Consequently, the... 

    Cache replacement policy based on expected hit count

    , Article IEEE Computer Architecture Letters ; 2017 ; 15566056 (ISSN) Vakil Ghahani, A ; Mahdizadeh Shahri, S ; Lotfi Namin, M ; Bakhshalipour, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    2017
    Abstract
    Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose significant performance potential. One of the components of a modern processor that has a prominent influence on the off-chip miss traffic is LLC's replacement policy. Existing processors employ a variation of least recently used (LRU) policy to determine the victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady's MIN, which is the optimal replacement policy. Belady's MIN requires selecting a victim with the longest reuse distance,... 

    Cache replacement policy based on expected hit count

    , Article IEEE Computer Architecture Letters ; Volume 17, Issue 1 , 2018 , Pages 64-67 ; 15566056 (ISSN) Vakil Ghahani, A ; Mahdizadeh Shahri, S ; Lotfi Namin, M. R ; Bakhshalipour, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2018
    Abstract
    Memory-intensive workloads operate on massive amounts of data that cannot be captured by last-level caches (LLCs) of modern processors. Consequently, processors encounter frequent off-chip misses, and hence, lose significant performance potential. One of the components of a modern processor that has a prominent influence on the off-chip miss traffic is LLC's replacement policy. Existing processors employ a variation of least recently used (LRU) policy to determine the victim for replacement. Unfortunately, there is a large gap between what LRU offers and that of Belady's MIN, which is the optimal replacement policy. Belady's MIN requires selecting a victim with the longest reuse distance,... 

    MANA: Microarchitecting a temporal instruction prefetcher

    , Article IEEE Transactions on Computers ; 2022 , Pages 1-1 ; 00189340 (ISSN) Ansari, A ; Golshan, F ; Barati, R ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    IEEE Computer Society  2022
    Abstract
    L1 instruction(L1-l) cache misses are a source of performance bottleneck. While many instruction prefetchers have been proposed, most of them leave a considerable potential uncovered. In 2011, Proactive Instruction Fetch (PIF) showed that a hardware prefetcher could effectively eliminate all instruction-cache misses. However, its enormous storage cost makes it impractical. Consequently, reducing the storage cost was the main research focus in instruction prefetching in the past decade. Several instruction prefetchers, including RDIP and Shotgun, were proposed to offer PIF-level performance with significantly lower storage overhead. However, our findings show that there is a considerable... 

    State-of-the-art data prefetchers

    , Article Advances in Computers ; Volume 125 , 2022 , Pages 55-67 ; 00652458 (ISSN); 9780323851190 (ISBN) Shakerinava, M ; Golshan, F ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2022
    Abstract
    We introduced several styles of data prefetching in the past three chapters. The introduced data prefetchers were known for a long time, sometimes for decades. In this chapter, we introduce several state-of-the-art data prefetchers, which have been introduced in the past few years. In particular, we introduce DOMINO, BINGO, MLOP, and RUNAHEAD METADATA. © 2022 Elsevier Inc  

    Evaluation of data prefetchers

    , Article Advances in Computers ; Volume 125 , 2022 , Pages 69-89 ; 00652458 (ISSN); 9780323851190 (ISBN) Shakerinava, M ; Golshan, F ; Ansari, A ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Academic Press Inc  2022
    Abstract
    We introduced several data prefetchers and qualitatively discussed their strengths and weaknesses. Without quantitative evaluation, the true strengths and weaknesses of a data prefetcher are still vague. To shed light on the strengths and weaknesses of the introduced data prefetchers and to enable the readers to better understand these prefetchers, in this chapter, we quantitatively compare and contrast them. © 2022 Elsevier Inc  

    MANA: Microarchitecting a temporal instruction prefetcher

    , Article IEEE Transactions on Computers ; Volume 72, Issue 3 , 2023 , Pages 732-743 ; 00189340 (ISSN) Ansari, A ; Golshan, F ; Barati, R ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    IEEE Computer Society  2023
    Abstract
    L1 instruction (L1-I) cache misses are a source of performance bottleneck. While many instruction prefetchers have been proposed over the years, most of them leave a considerable potential uncovered. In 2011, Proactive Instruction Fetch (PIF) showed that a hardware prefetcher could effectively eliminate all instruction-cache misses. However, its enormous storage cost makes it an impractical solution. Consequently, reducing the storage cost was the main research focus in instruction prefetching in the past decade. Several instruction prefetchers, including RDIP and Shotgun, were proposed to offer PIF-level performance with significantly lower storage overhead. However, our findings show that... 

    Enhanced TED: a new data structure for RTL verification

    , Article 21st International Conference on VLSI Design, VLSI DESIGN 2008, Hyderabad, 4 January 2008 through 8 January 2008 ; 2008 , Pages 481-486 ; 0769530834 (ISBN); 9780769530833 (ISBN) Lotfi Kamran, P ; Massoumi, M ; Mirzaei, M ; Navabi, Z ; VLSI Society of India ; Sharif University of Technology
    2008
    Abstract
    This work provides a canonical representation for manipulation of RTL designs. Work has already been done on a canonical and graph-based representation called Taylor Expansion Diagram (TED). Although TED can effectively be used to represent arithmetic expressions at the word-level, it is not memory efficient in representing bit-level logic expressions. In addition, TED cannot represent Boolean expressions at the word-level (vector-level). In this paper, we present modifications to TED that will improve its ability for bit-level logic representation while enhancing its robustness to represent word-level Boolean expressions. It will be shown that for bit-level logic expressions, the Enhanced...