Loading...
Search for: graphics-processing-units
0.007 seconds
Total 34 records

    Viterbi Decoder Implementation on GPGPU

    , M.Sc. Thesis Sharif University of Technology Mohammadidoost, Alireza (Author) ; Hashemi, Matin (Supervisor)
    Abstract
    In this project, a method is emoloyed to implement a Viterbi decoder on GPGPU. This method is based on combining all steps of the algorithm. This combination has some challenges that are related to differences between different steps of the algorithm. So in this project, some solutions are found to handle these challenges and a high-throughput Viterbi decoder is acheived  

    ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs

    , Article Microprocessors and Microsystems ; Volume 46 , 2016 , Pages 264-273 ; 01419331 (ISSN) Abbasitabar, H ; Samavatian, M. H ; Sarbazi Azad, H ; Sharif University of Technology
    Elsevier B.V  2016
    Abstract
    Spatial multi-programming is one of the most efficient multi-programming methods on Graphics Processing Units (GPUs). This multi-programming scheme generates variety in resource requirements of stream multiprocessors (SMs) and creates opportunities for sharing unused portions of each SM resource with other SMs. Although this approach drastically improves GPU performance, in some cases it leads to performance degradation due to the shortage of allocated resource to each program. Considering shared-memory as one of the main bottlenecks of thread-level parallelism (TLP), in this paper, we propose an adaptive shared-memory sharing architecture, called ASHA. ASHA enhances spatial... 

    3-point RANSAC for fast vision based rotation estimation using GPU technology

    , Article IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems9 February 2017 ; 2017 , Pages 212-217 ; 9781467397087 (ISBN) Kamran, D ; Manzuri, M. T ; Marjovi, A ; Karimian, M ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2017
    Abstract
    In many sensor fusion algorithms, the vision based RANdom Sample Consensus (RANSAC) method is used for estimating motion parameters for autonomous robots. Usually such algorithms estimate both translation and rotation parameters together which makes them inefficient solutions for merely rotation estimation purposes. This paper presents a novel 3-point RANSAC algorithm for estimating only the rotation parameters between two camera frames which can be utilized as a high rate source of information for a camera-IMU sensor fusion system. The main advantage of our proposed approach is that it performs less computations and requires fewer iterations for achieving the best result. Despite many... 

    LTRF: enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching

    , Article 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, 24 March 2018 through 28 March 2018 ; 2018 , Pages 489-502 ; 9781450349116 (ISBN) Sadrosadati, M ; Mirhosseini, A ; Ehsani, S. B ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology
    Association for Computing Machinery  2018
    Abstract
    Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical... 

    Highly concurrent latency-tolerant register files for GPUs

    , Article ACM Transactions on Computer Systems ; Volume 37, Issue 1-4 , 2021 ; 07342071 (ISSN) Sadrosadati, M ; Mirhosseini, A ; Hajiabadi, A ; Ehsani, S. B ; Falahati, H ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology
    Association for Computing Machinery  2021
    Abstract
    Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical... 

    Efficient nearest-neighbor data sharing in GPUs

    , Article ACM Transactions on Architecture and Code Optimization ; Volume 18, Issue 1 , 2021 ; 15443566 (ISSN) Nematollahi, N ; Sadrosadati, M ; Falahati, H ; Barkhordar, M ; Drumond, M. P ; Sarbazi Azad, H ; Falsafi, B ; Sharif University of Technology
    Association for Computing Machinery  2021
    Abstract
    Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data... 

    Power Reduction in GPUs through Intra-Warp Instruction Execution Reordering

    , M.Sc. Thesis Sharif University of Technology Aghilinasab, Homa (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    As technology shrinks, the static power consumption is getting worse. Moreover, considering high usage of General-Purpose Graphics Processing Units (GPGPU), reducing the static power of GPGPUs is becoming an important issue. Execution units in GPGPUs are one of the most power hungry units that play an essential role in total power consumption of GPGPUs. On the other hand power gating is a promising method to reduce static power consumption. In this project, we propose a novel method to implement power-gating method for execution units with the negligible performance and power overheads. We utilize out of order execution in intra warp to keep the power-gated resources in off state more than... 

    Improving Performance of GPGPU Considering Reliability Requirements

    , M.Sc. Thesis Sharif University of Technology Motallebi, Maryam (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    In recent years, GPUs are becoming ideal candidates for processing a variety of high performance applications. By relying on thousands of concurrent threads in applications and the computational power of large numbers of computing units, GPGPUs have provided high efficiency and throughput. To achieve the potential computational power of GPGPUs in broader types of applications, we need to apply some modifications. By understanding the features and properties of applications, we can execute them in a more proper way on GPUs. Therefore, considering applications’ behavior, we define 5 different categories for them. Every category has special definitions, and we change the configuration of GPU... 

    Design and Hardware Implementation of Optical Character Recognition

    , M.Sc. Thesis Sharif University of Technology Dezfuli, Sina (Author) ; Hashemi, Matin (Supervisor)
    Abstract
    The objective of OCR systems is to retrieve machine-encoded text from a raster image. Despite the abundance of powerful OCR algorithms for English, there are not many for Farsi. Our proposed algorithm is comprised of pre-processing, line detection, sub-word detection and segmentation, feature extraction and classification. Furthermore, hardware implementation and acceleration of this system on a GPGPU is presented. This algorithm was tested on 5 fonts including Titr, Lotus,Yekan, Koodak and Nazanin and an average accuracy above 90% was achieved  

    A Reconfigurable and Adaptive Shared-memory Architecture for GPUs

    , M.Sc. Thesis Sharif University of Technology Abbasitabar, Hamed (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    The importance of shared memory (scratchpad memory) in GPGPU programming, the memory size limits of GPGPUs and the influence of shared memory on overall performance of the GPGPU has led to its performance optimization. Moreover, the trend of new GPGPUs design shows that the ratio of shared memory to processing elements is going smaller. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this thesis we introduced a reconfigurable and adaptive shared memory architecture for GPGPUs based on resource sharing which can be exploited for throughput improvement... 

    Energy Reduction in GPGPUs

    , Ph.D. Dissertation Sharif University of Technology Falahati, Hajar (Author) ; Hessabi, Shahin (Supervisor) ; Baniasadi, Amirali (Co-Advisor)
    Abstract
    The number of transistors on a single chip is growing exponentially which results in a huge increase in consumed power and temperature. Parallel processing is a solution which concentrates on increasing the number of cores instead of improving single thread performance. Graphics Processing Units (GPUs) are parallel accelerators which are categorized as manycore systems. However, recent research shows that their consumed power and energy are increasing. In this research, we aim to propose methods to make GPGPUs energy effiecient. In this regard, we evaluated the detailed consummed power of GPGPUs. Our results show that memory sub-system is a critical bottelneck in terms of performance and... 

    Data Sharing Aware Scheduling for Reducing Memory Accesses in GPGPUs

    , M.Sc. Thesis Sharif University of Technology Saber Latibari, Banafsheh (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    Access to global memory is one of the bottlenecks in performance and energy in GPUs. Graphical processors uses multi-thread in streaming multiprocessors to reduce memory access latency. However, due to the high number of concurrent memory requests, the memory bandwidth of low level memorties and the interconnection network are quickly saturated. Recent research suggests that adjacent thread blocks share a significant amount of data blocks. If the adjacent thread blocks are assigned to specific streaming multiprocessor, shared data block can be rused by these thread blocks. However the thread block scheduler assigns adjacent thread blocks to different streaming multiprocessors that increase... 

    Unifying L1 Data Cache and Shared Memory in GPUs

    , M.Sc. Thesis Sharif University of Technology Yousefzadeh-Asl-Miandoab, Ehsan (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    Graphics Processing Units (GPUs) employ a scratch-pad memory (a.k.a., shared memory) in each streaming multiprocessor to accelerate data sharing among the threads in a thread block and provide a software-managed cache for the programmers.However, we observe that about 60% of GPU workloads of several well-known benchmark suites do not use shared memory. Morever, among those workloads that use shared memory, about 42% of shared memory is not utilized, on average. On the other hand, we observe that many general purpose GPU applications suffer from the low hit rate and limited bandwidth of L1 data cache.We aim to use shared memory space and its corrsponding bandwidth for improving L1 data cache,... 

    Mitigating Memory Access Overhead in GPUs Through Reproduction of Intermediate Results

    , M.Sc. Thesis Sharif University of Technology Barati, Rahil (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    GPUs employ large register files to reduce the performance and energy overhead of memory accesses through improving the thread-level parallelism, and reducing the number of data movements from the off-chip memory. Recently, latency-tolerant register file (LTRF) is proposed to enable high-capacity register files with low power and area cost. LTRF is a two-level register file in which the first level is a small fast register cache, and the second level is a large slow main register file. LTRF uses a near-perfect register prefetching mechanism that warp registers are prefetched from the main register file to the register file cache before scheduling the warp, and hiding the register prefetching... 

    Efficient Acceleration of Large-scale Graph Algorithms

    , M.Sc. Thesis Sharif University of Technology Gholami Shahrouz, Soheil (Author) ; Saleh Kaleybar, Saber (Supervisor) ; Hashemi, Matin (Supervisor)
    Abstract
    Given a social network modeled as a weighted graph G, the influence maximization problem seeks k vertices to become initially influenced, to maximize the expected number of influenced nodes under a particular diffusion model. The influence maximization problem has been proven to be NP-hard, and most proposed solutions to the problem are approximate greedy algorithms, which can guarantee a tunable approximation ratio for their results with respect to the optimal solution. The state-of-the-art algorithms are based on Reverse Influence Sampling (RIS) technique, which can offer both computational efficiency and non-trivial (1-1/e-ϵ)-approximation ratio guarantee for any ϵ>0. RIS-based... 

    Modeling the effect of process variations on the delay and power of the digital circuit using fast simulators

    , Article 2013 21st Iranian Conference on Electrical Engineering, ICEE 2013 ; 2013 , 14-16 May ; 9781467356343 (ISBN) Amirsoleimani, A ; Soleimani, H ; Ahmadi, A ; Bavandpour, M ; Zwolinski, M ; Sharif University of Technology
    2013
    Abstract
    Process variation has an increasingly dramatic effect on delay and power as process geometries shrink. Even if the amount of variation remains the same as in previous generations, it accounts for a greater percentage of process geometries as they get smaller. So an accurate prediction of path delay and power variability for real digital circuits in the current technologies is very important; however, its main drawback is the high runtime cost. In this paper, we present a new fast EDA tool which accelerates Monte Carlo based statistical static timing analysis (SSTA) for complex digital circuit. Parallel platforms like Message Passing Interface and POSIX® Threads and also the GPU-based CUDA... 

    A GPU based simulation platform for adaptive frequency Hopf oscillators

    , Article ICEE 2012 - 20th Iranian Conference on Electrical Engineering ; 2012 , Pages 884-888 ; 9781467311489 (ISBN) Soleimani, H ; Maleki, M. A ; Ahmadi, A ; Bavandpour, M ; Maharatna, K ; Zwolinski, M ; Sharif University of Technology
    2012
    Abstract
    In this paper we demonstrate a dynamical system simulator that runs on a single GPU. The model (running on an NVIDIA GT325M with 1GB of memory) is up to 50 times faster than a CPU version when more than 10 million adaptive Hopf oscillators have been simulated. The simulation shows that the oscillators tune to the correct frequencies for both discrete and continuous spectra. Due to its dynamic nature the system is also capable to track non-stationary spectra. With the help of this model the frequency spectrum of an ECG signal (as a non-stationary signal) obtained and was showed that frequency domain representation of signal (i.e. FFT) is the same as one MATLAB generates  

    Adaptive sparse matrix representation for efficient matrix–vector multiplication

    , Article Journal of Supercomputing ; November , 2015 , Pages 1-21 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology
    Springer New York LLC  2015
    Abstract
    A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for... 

    Adaptive sparse matrix representation for efficient matrix–vector multiplication

    , Article Journal of Supercomputing ; Volume 72, Issue 9 , Volume 72, Issue 9 , 2016 , Pages 3366-3386 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology
    Springer New York LLC 
    Abstract
    A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for... 

    ITAP: Idle-time-aware power management for GPU execution units

    , Article ACM Transactions on Architecture and Code Optimization ; Volume 16, Issue 1 , 2019 ; 15443566 (ISSN) Sadrosadati, M ; Ehsani, S. B ; Falahati, H ; Ausavarungnirun, R ; Tavakkol, A ; Abaee, M ; Orosa, L ; Wang, Y ; Sarbazi Azad, H ; Mutlu, O ; Sharif University of Technology
    Association for Computing Machinery  2019
    Abstract
    Graphics Processing Units (GPUS) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUS suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU-the execution units-frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of...