Loading...
Search for: gpgpu
0.007 seconds

    Viterbi Decoder Implementation on GPGPU

    , M.Sc. Thesis Sharif University of Technology Mohammadidoost, Alireza (Author) ; Hashemi, Matin (Supervisor)
    Abstract
    In this project, a method is emoloyed to implement a Viterbi decoder on GPGPU. This method is based on combining all steps of the algorithm. This combination has some challenges that are related to differences between different steps of the algorithm. So in this project, some solutions are found to handle these challenges and a high-throughput Viterbi decoder is acheived  

    ISP: Using idle SMs in hardware-based prefetching

    , Article Proceedings - 17th CSI International Symposium on Computer Architecture and Digital Systems, CADS 2013 ; October , 2013 , Pages 3-8 ; 9781479905621 (ISBN) Falahati, H ; Abdi, M ; Baniasadi, A ; Hessabi, S ; Computer Society of Iran; IPM ; Sharif University of Technology
    IEEE Computer Society  2013
    Abstract
    The Graphics Processing Unit (GPU) is the most promising candidate platform for faster rate of improvement in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architecture challenges. In this paper, we focus on improving performance by better hiding long waiting time to transfer data from the slow global memory. Thereupon study an effective light-overhead prefetching mechanism, which utilizes idle processing elements. Our results show that we can potentially improve... 

    ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs

    , Article Microprocessors and Microsystems ; Volume 46 , 2016 , Pages 264-273 ; 01419331 (ISSN) Abbasitabar, H ; Samavatian, M. H ; Sarbazi Azad, H ; Sharif University of Technology
    Elsevier B.V  2016
    Abstract
    Spatial multi-programming is one of the most efficient multi-programming methods on Graphics Processing Units (GPUs). This multi-programming scheme generates variety in resource requirements of stream multiprocessors (SMs) and creates opportunities for sharing unused portions of each SM resource with other SMs. Although this approach drastically improves GPU performance, in some cases it leads to performance degradation due to the shortage of allocated resource to each program. Considering shared-memory as one of the main bottlenecks of thread-level parallelism (TLP), in this paper, we propose an adaptive shared-memory sharing architecture, called ASHA. ASHA enhances spatial... 

    Power Reduction in GPUs through Intra-Warp Instruction Execution Reordering

    , M.Sc. Thesis Sharif University of Technology Aghilinasab, Homa (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    As technology shrinks, the static power consumption is getting worse. Moreover, considering high usage of General-Purpose Graphics Processing Units (GPGPU), reducing the static power of GPGPUs is becoming an important issue. Execution units in GPGPUs are one of the most power hungry units that play an essential role in total power consumption of GPGPUs. On the other hand power gating is a promising method to reduce static power consumption. In this project, we propose a novel method to implement power-gating method for execution units with the negligible performance and power overheads. We utilize out of order execution in intra warp to keep the power-gated resources in off state more than... 

    Improving Performance of GPGPU Considering Reliability Requirements

    , M.Sc. Thesis Sharif University of Technology Motallebi, Maryam (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    In recent years, GPUs are becoming ideal candidates for processing a variety of high performance applications. By relying on thousands of concurrent threads in applications and the computational power of large numbers of computing units, GPGPUs have provided high efficiency and throughput. To achieve the potential computational power of GPGPUs in broader types of applications, we need to apply some modifications. By understanding the features and properties of applications, we can execute them in a more proper way on GPUs. Therefore, considering applications’ behavior, we define 5 different categories for them. Every category has special definitions, and we change the configuration of GPU... 

    Design and Hardware Implementation of Optical Character Recognition

    , M.Sc. Thesis Sharif University of Technology Dezfuli, Sina (Author) ; Hashemi, Matin (Supervisor)
    Abstract
    The objective of OCR systems is to retrieve machine-encoded text from a raster image. Despite the abundance of powerful OCR algorithms for English, there are not many for Farsi. Our proposed algorithm is comprised of pre-processing, line detection, sub-word detection and segmentation, feature extraction and classification. Furthermore, hardware implementation and acceleration of this system on a GPGPU is presented. This algorithm was tested on 5 fonts including Titr, Lotus,Yekan, Koodak and Nazanin and an average accuracy above 90% was achieved  

    A Reconfigurable and Adaptive Shared-memory Architecture for GPUs

    , M.Sc. Thesis Sharif University of Technology Abbasitabar, Hamed (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    The importance of shared memory (scratchpad memory) in GPGPU programming, the memory size limits of GPGPUs and the influence of shared memory on overall performance of the GPGPU has led to its performance optimization. Moreover, the trend of new GPGPUs design shows that the ratio of shared memory to processing elements is going smaller. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this thesis we introduced a reconfigurable and adaptive shared memory architecture for GPGPUs based on resource sharing which can be exploited for throughput improvement... 

    Energy Reduction in GPGPUs

    , Ph.D. Dissertation Sharif University of Technology Falahati, Hajar (Author) ; Hessabi, Shahin (Supervisor) ; Baniasadi, Amirali (Co-Advisor)
    Abstract
    The number of transistors on a single chip is growing exponentially which results in a huge increase in consumed power and temperature. Parallel processing is a solution which concentrates on increasing the number of cores instead of improving single thread performance. Graphics Processing Units (GPUs) are parallel accelerators which are categorized as manycore systems. However, recent research shows that their consumed power and energy are increasing. In this research, we aim to propose methods to make GPGPUs energy effiecient. In this regard, we evaluated the detailed consummed power of GPGPUs. Our results show that memory sub-system is a critical bottelneck in terms of performance and... 

    Data Sharing Aware Scheduling for Reducing Memory Accesses in GPGPUs

    , M.Sc. Thesis Sharif University of Technology Saber Latibari, Banafsheh (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    Access to global memory is one of the bottlenecks in performance and energy in GPUs. Graphical processors uses multi-thread in streaming multiprocessors to reduce memory access latency. However, due to the high number of concurrent memory requests, the memory bandwidth of low level memorties and the interconnection network are quickly saturated. Recent research suggests that adjacent thread blocks share a significant amount of data blocks. If the adjacent thread blocks are assigned to specific streaming multiprocessor, shared data block can be rused by these thread blocks. However the thread block scheduler assigns adjacent thread blocks to different streaming multiprocessors that increase... 

    Unifying L1 Data Cache and Shared Memory in GPUs

    , M.Sc. Thesis Sharif University of Technology Yousefzadeh-Asl-Miandoab, Ehsan (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    Graphics Processing Units (GPUs) employ a scratch-pad memory (a.k.a., shared memory) in each streaming multiprocessor to accelerate data sharing among the threads in a thread block and provide a software-managed cache for the programmers.However, we observe that about 60% of GPU workloads of several well-known benchmark suites do not use shared memory. Morever, among those workloads that use shared memory, about 42% of shared memory is not utilized, on average. On the other hand, we observe that many general purpose GPU applications suffer from the low hit rate and limited bandwidth of L1 data cache.We aim to use shared memory space and its corrsponding bandwidth for improving L1 data cache,... 

    Mitigating Memory Access Overhead in GPUs Through Reproduction of Intermediate Results

    , M.Sc. Thesis Sharif University of Technology Barati, Rahil (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    GPUs employ large register files to reduce the performance and energy overhead of memory accesses through improving the thread-level parallelism, and reducing the number of data movements from the off-chip memory. Recently, latency-tolerant register file (LTRF) is proposed to enable high-capacity register files with low power and area cost. LTRF is a two-level register file in which the first level is a small fast register cache, and the second level is a large slow main register file. LTRF uses a near-perfect register prefetching mechanism that warp registers are prefetched from the main register file to the register file cache before scheduling the warp, and hiding the register prefetching... 

    Efficient Acceleration of Large-scale Graph Algorithms

    , M.Sc. Thesis Sharif University of Technology Gholami Shahrouz, Soheil (Author) ; Saleh Kaleybar, Saber (Supervisor) ; Hashemi, Matin (Supervisor)
    Abstract
    Given a social network modeled as a weighted graph G, the influence maximization problem seeks k vertices to become initially influenced, to maximize the expected number of influenced nodes under a particular diffusion model. The influence maximization problem has been proven to be NP-hard, and most proposed solutions to the problem are approximate greedy algorithms, which can guarantee a tunable approximation ratio for their results with respect to the optimal solution. The state-of-the-art algorithms are based on Reverse Influence Sampling (RIS) technique, which can offer both computational efficiency and non-trivial (1-1/e-ϵ)-approximation ratio guarantee for any ϵ>0. RIS-based... 

    Power-efficient prefetching on GPGPUs

    , Article Journal of Supercomputing ; Volume 71, Issue 8 , August , 2015 , pp. 2808-2829 ; ISSN: 09208542 Falahati, H ; Hessabi, S ; Abdi, M ; Baniasadi, A ; Sharif University of Technology
    Abstract
    The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architectural challenges. In this paper, we focus on improving performance by better hiding long waiting time for transferring data from the slow global memory. Furthermore, we show that the proposed method can reduce power and energy. Reduction in access time to off-chip data has a noticeable role in reducing... 

    Architecting the last-level cache for GPUs using STT-RAM technology

    , Article Transactions on Design Automation of Electronic Systems ; Volume 20, Issue 4 , 2015 ; 10844309 (ISSN) Samavatian, M. H ; Arjomand, M ; Bashizade, R ; Sarbazi Azad, H ; Sharif University of Technology
    Abstract
    Future GPUs should have larger L2 caches based on the current trends in VLSI technology and GPU architectures toward increase of processing core count. Larger L2 caches inevitably have proportionally larger power consumption. In this article, having investigated the behavior of GPGPU applications, we present an efficient L2 cache architecture for GPUs based on STT-RAM technology. Due to its high-density and low-power characteristics, STT-RAM technology can be utilized in GPUs where numerous cores leave a limited area for on-chip memory banks. They have, however, two important issues, high energy and latency of write operations, that have to be addressed. Low retention time STT-RAMs can... 

    An efficient STT-Ram last level cache architecture for GPUs

    , Article Proceedings - Design Automation Conference ; 2-5 June , 2014 , pp. 1-6 ; ISSN: 0738100X ; ISBN: 9781479930173 Samavatian, M. H ; Abbasitabar, H ; Arjomand, M ; Sarbazi-Azad, H ; Sharif University of Technology
    Abstract
    In this paper, having investigated the behavior of GPGPU applications, we present an effcient L2 cache architecture for GPUs based on STT-RAM technology. With the increase of processing cores count, larger on-chip memories are required. Due to its high density and low power characteristics, STT-RAM technology can be utilized in GPUs where numerous cores leave a limited area for on-chip memory banks. They have however two important issues, high energy and latency of write operations, that have to be addressed. Low data retention time STT-RAMs can reduce the energy and delay of write operations. However, employing STT-RAMs with low retention time in GPUs requires a thorough investigation on... 

    A Novel STT-RAM Architecture for Last Level Shared Caches in GPUs

    , M.Sc. Thesis Sharif University of Technology Samavatian, Mohammad Hossein (Author) ; Sarbazi-Azad, Hamid (Supervisor)
    Abstract
    Due to the high processing capacity of GPGPUs and their requirement to a large and high speed shared memory between thread processors clusters, exploiting Spin-Transfer Torque (STT) RAM as a replacement with SRAM can result in significant reduction in power consumption and linear enhancement of memory capacity in GPGPUs. In the GPGPU (as a many-core) with ability of parallel thread executing, advantages of STT-RAM technology, such as low read latency and high density, could be so effective. However, the usage of STT-RAM will be grantee applications run time reduction and growth threads throughput, when write operations manages and schedules to have least overhead on read operations. The... 

    Efficient nearest-neighbor data sharing in GPUs

    , Article ACM Transactions on Architecture and Code Optimization ; Volume 18, Issue 1 , 2021 ; 15443566 (ISSN) Nematollahi, N ; Sadrosadati, M ; Falahati, H ; Barkhordar, M ; Drumond, M. P ; Sarbazi Azad, H ; Falsafi, B ; Sharif University of Technology
    Association for Computing Machinery  2021
    Abstract
    Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data...