Sharif Digital Repository / Sharif University of Technology / Search result

Unifying L1 Data Cache and Shared Memory in GPUs

, M.Sc. Thesis Sharif University of Technology Yousefzadeh-Asl-Miandoab, Ehsan (Author) ; Sarbazi Azad, Hamid (Supervisor)

Abstract

Graphics Processing Units (GPUs) employ a scratch-pad memory (a.k.a., shared memory) in each streaming multiprocessor to accelerate data sharing among the threads in a thread block and provide a software-managed cache for the programmers.However, we observe that about 60% of GPU workloads of several well-known benchmark suites do not use shared memory. Morever, among those workloads that use shared memory, about 42% of shared memory is not utilized, on average. On the other hand, we observe that many general purpose GPU applications suffer from the low hit rate and limited bandwidth of L1 data cache.We aim to use shared memory space and its corrsponding bandwidth for improving L1 data cache,...

محتواي کتاب

Improving Performance of GPGPU Considering Reliability Requirements

, M.Sc. Thesis Sharif University of Technology Motallebi, Maryam (Author) ; Hesabi, Shahin (Supervisor)

Abstract

In recent years, GPUs are becoming ideal candidates for processing a variety of high performance applications. By relying on thousands of concurrent threads in applications and the computational power of large numbers of computing units, GPGPUs have provided high efficiency and throughput. To achieve the potential computational power of GPGPUs in broader types of applications, we need to apply some modifications. By understanding the features and properties of applications, we can execute them in a more proper way on GPUs. Therefore, considering applications’ behavior, we define 5 different categories for them. Every category has special definitions, and we change the configuration of GPU...

محتواي کتاب

Viterbi Decoder Implementation on GPGPU

, M.Sc. Thesis Sharif University of Technology Mohammadidoost, Alireza (Author) ; Hashemi, Matin (Supervisor)

Abstract

In this project, a method is emoloyed to implement a Viterbi decoder on GPGPU. This method is based on combining all steps of the algorithm. This combination has some challenges that are related to differences between different steps of the algorithm. So in this project, some solutions are found to handle these challenges and a high-throughput Viterbi decoder is acheived

محتواي کتاب

Energy Reduction in GPGPUs

, Ph.D. Dissertation Sharif University of Technology Falahati, Hajar (Author) ; Hessabi, Shahin (Supervisor) ; Baniasadi, Amirali (Co-Advisor)

Abstract

The number of transistors on a single chip is growing exponentially which results in a huge increase in consumed power and temperature. Parallel processing is a solution which concentrates on increasing the number of cores instead of improving single thread performance. Graphics Processing Units (GPUs) are parallel accelerators which are categorized as manycore systems. However, recent research shows that their consumed power and energy are increasing. In this research, we aim to propose methods to make GPGPUs energy effiecient. In this regard, we evaluated the detailed consummed power of GPGPUs. Our results show that memory sub-system is a critical bottelneck in terms of performance and...

محتواي کتاب

Efficient Acceleration of Large-scale Graph Algorithms

, M.Sc. Thesis Sharif University of Technology Gholami Shahrouz, Soheil (Author) ; Saleh Kaleybar, Saber (Supervisor) ; Hashemi, Matin (Supervisor)

Abstract

Given a social network modeled as a weighted graph G, the influence maximization problem seeks k vertices to become initially influenced, to maximize the expected number of influenced nodes under a particular diffusion model. The influence maximization problem has been proven to be NP-hard, and most proposed solutions to the problem are approximate greedy algorithms, which can guarantee a tunable approximation ratio for their results with respect to the optimal solution. The state-of-the-art algorithms are based on Reverse Influence Sampling (RIS) technique, which can offer both computational efficiency and non-trivial (1-1/e-ϵ)-approximation ratio guarantee for any ϵ>0. RIS-based...

محتواي کتاب

Power Reduction in GPUs through Intra-Warp Instruction Execution Reordering

, M.Sc. Thesis Sharif University of Technology Aghilinasab, Homa (Author) ; Sarbazi Azad, Hamid (Supervisor)

Abstract

As technology shrinks, the static power consumption is getting worse. Moreover, considering high usage of General-Purpose Graphics Processing Units (GPGPU), reducing the static power of GPGPUs is becoming an important issue. Execution units in GPGPUs are one of the most power hungry units that play an essential role in total power consumption of GPGPUs. On the other hand power gating is a promising method to reduce static power consumption. In this project, we propose a novel method to implement power-gating method for execution units with the negligible performance and power overheads. We utilize out of order execution in intra warp to keep the power-gated resources in off state more than...

محتواي کتاب

A Reconfigurable and Adaptive Shared-memory Architecture for GPUs

, M.Sc. Thesis Sharif University of Technology Abbasitabar, Hamed (Author) ; Sarbazi Azad, Hamid (Supervisor)

Abstract

The importance of shared memory (scratchpad memory) in GPGPU programming, the memory size limits of GPGPUs and the influence of shared memory on overall performance of the GPGPU has led to its performance optimization. Moreover, the trend of new GPGPUs design shows that the ratio of shared memory to processing elements is going smaller. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this thesis we introduced a reconfigurable and adaptive shared memory architecture for GPGPUs based on resource sharing which can be exploited for throughput improvement...

محتواي کتاب

Data Sharing Aware Scheduling for Reducing Memory Accesses in GPGPUs

, M.Sc. Thesis Sharif University of Technology Saber Latibari, Banafsheh (Author) ; Hesabi, Shahin (Supervisor)

Abstract

Access to global memory is one of the bottlenecks in performance and energy in GPUs. Graphical processors uses multi-thread in streaming multiprocessors to reduce memory access latency. However, due to the high number of concurrent memory requests, the memory bandwidth of low level memorties and the interconnection network are quickly saturated. Recent research suggests that adjacent thread blocks share a significant amount of data blocks. If the adjacent thread blocks are assigned to specific streaming multiprocessor, shared data block can be rused by these thread blocks. However the thread block scheduler assigns adjacent thread blocks to different streaming multiprocessors that increase...

محتواي کتاب

A Novel STT-RAM Architecture for Last Level Shared Caches in GPUs

, M.Sc. Thesis Sharif University of Technology Samavatian, Mohammad Hossein (Author) ; Sarbazi-Azad, Hamid (Supervisor)

Abstract

Due to the high processing capacity of GPGPUs and their requirement to a large and high speed shared memory between thread processors clusters, exploiting Spin-Transfer Torque (STT) RAM as a replacement with SRAM can result in significant reduction in power consumption and linear enhancement of memory capacity in GPGPUs. In the GPGPU (as a many-core) with ability of parallel thread executing, advantages of STT-RAM technology, such as low read latency and high density, could be so effective. However, the usage of STT-RAM will be grantee applications run time reduction and growth threads throughput, when write operations manages and schedules to have least overhead on read operations. The...

محتواي کتاب

Design and Hardware Implementation of Optical Character Recognition

, M.Sc. Thesis Sharif University of Technology Dezfuli, Sina (Author) ; Hashemi, Matin (Supervisor)

Abstract

The objective of OCR systems is to retrieve machine-encoded text from a raster image. Despite the abundance of powerful OCR algorithms for English, there are not many for Farsi. Our proposed algorithm is comprised of pre-processing, line detection, sub-word detection and segmentation, feature extraction and classification. Furthermore, hardware implementation and acceleration of this system on a GPGPU is presented. This algorithm was tested on 5 fonts including Titr, Lotus,Yekan, Koodak and Nazanin and an average accuracy above 90% was achieved

محتواي کتاب

Mitigating Memory Access Overhead in GPUs Through Reproduction of Intermediate Results

, M.Sc. Thesis Sharif University of Technology Barati, Rahil (Author) ; Sarbazi Azad, Hamid (Supervisor)

Abstract

GPUs employ large register files to reduce the performance and energy overhead of memory accesses through improving the thread-level parallelism, and reducing the number of data movements from the off-chip memory. Recently, latency-tolerant register file (LTRF) is proposed to enable high-capacity register files with low power and area cost. LTRF is a two-level register file in which the first level is a small fast register cache, and the second level is a large slow main register file. LTRF uses a near-perfect register prefetching mechanism that warp registers are prefetched from the main register file to the register file cache before scheduling the warp, and hiding the register prefetching...

محتواي کتاب

CuPC: CUDA-Based parallel PC algorithm for causal structure learning on GPU

, Article IEEE Transactions on Parallel and Distributed Systems ; Volume 31, Issue 3 , 2020 , Pages 530-542 Zarebavani, B ; Jafarinejad, F ; Hashemi, M ; Salehkaleybar, S ; Sharif University of Technology

IEEE Computer Society 2020

Abstract

The main goal in many fields in the empirical sciences is to discover causal relationships among a set of variables from observational data. PC algorithm is one of the promising solutions to learn underlying causal structure by performing a number of conditional independence tests. In this paper, we propose a novel GPU-based parallel algorithm, called cuPC, to execute an order-independent version of PC. The proposed solution has two variants, cuPC-E and cuPC-S, which parallelize PC in two different ways for multivariate normal distribution. Experimental results show the scalability of the proposed algorithms with respect to the number of variables, the number of samples, and different graph...

Adaptive sparse matrix representation for efficient matrix–vector multiplication

, Article Journal of Supercomputing ; November , 2015 , Pages 1-21 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology

Springer New York LLC 2015

Abstract

A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for...

Adaptive sparse matrix representation for efficient matrix–vector multiplication

, Article Journal of Supercomputing ; Volume 72, Issue 9 , Volume 72, Issue 9 , 2016 , Pages 3366-3386 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology

Springer New York LLC

Abstract

A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for...

A GPU based simulation platform for adaptive frequency Hopf oscillators

, Article ICEE 2012 - 20th Iranian Conference on Electrical Engineering ; 2012 , Pages 884-888 ; 9781467311489 (ISBN) Soleimani, H ; Maleki, M. A ; Ahmadi, A ; Bavandpour, M ; Maharatna, K ; Zwolinski, M ; Sharif University of Technology

2012

Abstract

In this paper we demonstrate a dynamical system simulator that runs on a single GPU. The model (running on an NVIDIA GT325M with 1GB of memory) is up to 50 times faster than a CPU version when more than 10 million adaptive Hopf oscillators have been simulated. The simulation shows that the oscillators tune to the correct frequencies for both discrete and continuous spectra. Due to its dynamic nature the system is also capable to track non-stationary spectra. With the help of this model the frequency spectrum of an ECG signal (as a non-stationary signal) obtained and was showed that frequency domain representation of signal (i.e. FFT) is the same as one MATLAB generates

GIM: GPU accelerated RIS-Based influence maximization algorithm

, Article IEEE Transactions on Parallel and Distributed Systems ; Volume 32, Issue 10 , 2021 , Pages 2386-2399 ; 10459219 (ISSN) Shahrouz, S ; Salehkaleybar, S ; Hashemi, M ; Sharif University of Technology

IEEE Computer Society 2021

Abstract

Given a social network modeled as a weighted graph GG, the influence maximization problem seeks kk vertices to become initially influenced, to maximize the expected number of influenced nodes under a particular diffusion model. The influence maximization problem has been proven to be NP-hard, and most proposed solutions to the problem are approximate greedy algorithms, which can guarantee a tunable approximation ratio for their results with respect to the optimal solution. The state-of-the-art algorithms are based on Reverse Influence Sampling (RIS) technique, which can offer both computational efficiency and non-trivial (1-1/e-ϵ)-approximation ratio guarantee for any epsilon >0ϵ>0....

GPU implementation of split-field finite difference time-domain method for drudelorentz dispersive media

, Article Progress in Electromagnetics Research ; Volume 125 , 2012 , Pages 55-77 ; 10704698 (ISSN) Shahmansouri, A ; Rashidian, B ; Sharif University of Technology

2012

Abstract

Split-field finite-difference time-domain (SF-FDTD) method can overcome the limitation of ordinary FDTD in analyzing periodic structures under oblique incidence. On the other hand, huge run times of 3D SF-FDTD, is practically a major burden in its usage for analysis and design of nanostructures, particularly when having dispersive media. Here, details of parallel implementation of 3D SF-FDTD method for dispersive media, combined with totalfield/ scattered-field (TF/SF) method for injecting oblique plane wave, are discussed. Graphics processing unit (GPU) has been used for this purpose, and very large speed up factors have been achieved. Also a previously reported formulation of SF-FDTD based...

LTRF: enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching

, Article 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, 24 March 2018 through 28 March 2018 ; 2018 , Pages 489-502 ; 9781450349116 (ISBN) Sadrosadati, M ; Mirhosseini, A ; Ehsani, S. B ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology

Association for Computing Machinery 2018

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical...

ITAP: Idle-time-aware power management for GPU execution units

, Article ACM Transactions on Architecture and Code Optimization ; Volume 16, Issue 1 , 2019 ; 15443566 (ISSN) Sadrosadati, M ; Ehsani, S. B ; Falahati, H ; Ausavarungnirun, R ; Tavakkol, A ; Abaee, M ; Orosa, L ; Wang, Y ; Sarbazi Azad, H ; Mutlu, O ; Sharif University of Technology

Association for Computing Machinery 2019

Abstract

Graphics Processing Units (GPUS) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUS suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU-the execution units-frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of...

Highly concurrent latency-tolerant register files for GPUs

, Article ACM Transactions on Computer Systems ; Volume 37, Issue 1-4 , 2021 ; 07342071 (ISSN) Sadrosadati, M ; Mirhosseini, A ; Hajiabadi, A ; Ehsani, S. B ; Falahati, H ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology

Association for Computing Machinery 2021

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical...