Loading...
Search for: memory-access
0.004 seconds

    ISP: Using idle SMs in hardware-based prefetching

    , Article Proceedings - 17th CSI International Symposium on Computer Architecture and Digital Systems, CADS 2013 ; October , 2013 , Pages 3-8 ; 9781479905621 (ISBN) Falahati, H ; Abdi, M ; Baniasadi, A ; Hessabi, S ; Computer Society of Iran; IPM ; Sharif University of Technology
    IEEE Computer Society  2013
    Abstract
    The Graphics Processing Unit (GPU) is the most promising candidate platform for faster rate of improvement in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architecture challenges. In this paper, we focus on improving performance by better hiding long waiting time to transfer data from the slow global memory. Thereupon study an effective light-overhead prefetching mechanism, which utilizes idle processing elements. Our results show that we can potentially improve... 

    Power-efficient prefetching on GPGPUs

    , Article Journal of Supercomputing ; Volume 71, Issue 8 , August , 2015 , pp. 2808-2829 ; ISSN: 09208542 Falahati, H ; Hessabi, S ; Abdi, M ; Baniasadi, A ; Sharif University of Technology
    Abstract
    The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architectural challenges. In this paper, we focus on improving performance by better hiding long waiting time for transferring data from the slow global memory. Furthermore, we show that the proposed method can reduce power and energy. Reduction in access time to off-chip data has a noticeable role in reducing... 

    Reducing access latency of MLC PCMs through line striping

    , Article Proceedings - International Symposium on Computer Architecture ; Article number 6853228 , 14-18 June , 2014 , p. 277-288 ; ISSN: 10636897 ; ISBN: 9781479943968 Hoseinzadeh, M ; Arjomand, M ; Sarbazi-Azad, H ; Sharif University of Technology
    Abstract
    Although phase change memory with multi-bit storage capability (known as MLC PCM) offers a good combination of high bit-density and non-volatility, its performance is severely impacted by the increased read/write latency. Regarding read operation, access latency increases almost linearly with respect to cell density (the number of bits stored in a cell). Since reads are latency critical, they can seriously impact system performance. This paper alleviates the problem of slow reads in the MLC PCM by exploiting a fundamental property of MLC devices: the Most-Significant Bit (MSB) of MLC cells can be read as fast as SLC cells, while reading the Least-Significant Bits (LSBs) is slower. We propose... 

    Data-Aware compression of neural networks

    , Article IEEE Computer Architecture Letters ; Volume 20, Issue 2 , 2021 , Pages 94-97 ; 15566056 (ISSN) Falahati, H ; Peyro, M ; Amini, H ; Taghian, M ; Sadrosadati, M ; Lotfi Kamran, P ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2021
    Abstract
    Deep Neural networks (DNNs) are getting deeper and larger which intensify the data movement and compute demands. Prior work focuses on reducing data movements and computation through exploiting sparsity and similarity. However, none of them exploit input similarity and only focus on sparsity and weight similarity. Synergistically analysing the similarity and sparsity of inputs and weights, we show that memory accesses and computations can be reduced by 5.7× and 4.1×, more than what can be decreased by exploiting only sparsity, and 3.9× and 2.1×, more than what can be decreased by exploiting only weight similarity. We propose a new data-aware compression approach, called DANA, to effectively... 

    A case for PIM support in general-purpose compilers

    , Article IEEE Design and Test ; 2021 ; 21682356 (ISSN) Sadeghi, P ; Ejlali, A ; Sharif University of Technology
    IEEE Computer Society  2021
    Abstract
    Newly developed 3D die stacking technologies affords us the possibility to revisit the idea of Processing-in-Memory (PIM) as implementation hurdles are overcome. We now have the opportunity to offload the data intensive parts of our program to the PIM in form of kernels to be able to take advantage of the high internal bandwidth of the memory modules. Memory access latency and bandwidth are two major bottlenecks in today’s high-performance computers and new use-cases are moving faster than ever before towards this mode of computing. With new graph processing and neural network applications being developed every day, having a performance model of such systems helps in predicting the behavior... 

    Data Sharing Aware Scheduling for Reducing Memory Accesses in GPGPUs

    , M.Sc. Thesis Sharif University of Technology Saber Latibari, Banafsheh (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    Access to global memory is one of the bottlenecks in performance and energy in GPUs. Graphical processors uses multi-thread in streaming multiprocessors to reduce memory access latency. However, due to the high number of concurrent memory requests, the memory bandwidth of low level memorties and the interconnection network are quickly saturated. Recent research suggests that adjacent thread blocks share a significant amount of data blocks. If the adjacent thread blocks are assigned to specific streaming multiprocessor, shared data block can be rused by these thread blocks. However the thread block scheduler assigns adjacent thread blocks to different streaming multiprocessors that increase... 

    A reliable 3D MLC PCM architecture with resistance drift predictor

    , Article Proceedings of the International Conference on Dependable Systems and Networks ; 23- 26 June , 2014 , pp. 204-215 ; ISBN: 9781479922338 Jalili, M ; Arjomand, M ; Azad, H. S ; Sharif University of Technology
    Abstract
    In this paper, we study the problem of resistance drift in an MLC Phase Change Memory (PCM) and propose a solution to circumvent its thermally-affected accelerated rate in 3D CMPs. Our scheme is based on the observation that instead of alleviating the problem of resistance drift by using large margins or error correction codes, the PCM read circuit can be reconfigured for tolerating most of the resistance drift errors in a dynamic manner. Through detailed characterization of memory access patterns for 22 applications, we propose an efficient mechanism to facilitate such reliable read scheme via tolerating (a) early-cycle resistance drifts by using narrow margins so that considerably saving... 

    A survey of medical image registration on multicore and the GPU

    , Article IEEE Signal Processing Magazine ; Volume 27, Issue 2 , 2010 , Pages 50-60 ; 10535888 (ISSN) Shams, R ; Sadeghi, P ; Kennedy, R ; Hartley, R ; Sharif University of Technology
    2010
    Abstract
    In this article, we look at early, recent, and state-of-the-art methods for registration of medical images using a range of high-performance computing (HPC) architectures including symmetric multiprocessing (SMP), massively multiprocessing (MMP), and architectures with distributed memory (DM), and nonuniform memory access (NUMA). The article is designed to be self-sufficient. We will take the time to define and describe concepts of interest, albeit briefly, in the context of image registration and HPC. We provide an overview of the registration problem and its main components in the section "Registration." Our main focus will be HPC-related aspects, and we will highlight relevant issues as... 

    A table-based application-specific prefetch engine for object-oriented embedded systems

    , Article 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, IC-SAMOS 2006, Samos, 17 July 2006 through 20 July 2006 ; 2006 , Pages 7-13 ; 1424401550 (ISBN); 9781424401550 (ISBN) Hessabi, S ; Modarressi, M ; Goudarzi, M ; Javanhemmat, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2006
    Abstract
    A table-based application-specific data prefetching mechanism is presented in this paper. This mechanism is proposed to improve the performance of the application specific instruction-set processors (ASIP) we develop customized to an object-oriented application. In this approach, we divide the data accesses of a class method into two conditional and unconditional parts. We supply the prefetch engine with the static information about each part to prefetch all data fields of an object required by a class method when the class method is invoked. Effective management of memory access patterns by dividing them based on the method to which they belong and storing the access information of nested... 

    A case for PIM support in general-purpose compilers

    , Article IEEE Design and Test ; Volume 39, Issue 2 , 2022 , Pages 84-89 ; 21682356 (ISSN) Sadeghi, P ; Ejlali, A ; Sharif University of Technology
    IEEE Computer Society  2022
    Abstract
    This work presents a case for general support for processing-in-memory (PIM) in compilers and puts forth an approach to face it along with a simple model. The ultimate goal of the work is to implement the features in a general-purpose compiler that can compile for any homogeneous ISA system, so the benefits from PIM are not limited to niche use-cases. © 2013 IEEE