Sharif Digital Repository / Sharif University of Technology / Search result

Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality

, Article IET Computers and Digital Techniques ; Volume 9, Issue 5 , August , 2015 , Pages 275-282 ; 17518601 (ISSN) Keshtegar, M. M ; Falahati, H ; Hessabi, S ; Sharif University of Technology

Institution of Engineering and Technology 2015

Abstract

Owing to a new platform for high performance and general-purpose computing, graphics processing unit (GPU) is one of the most promising candidates for faster improvement in peak processing speed, low latency and high performance. As GPUs employ multithreading to hide latency, there is a small private data cache in each single instruction multiple thread (SIMT) core. Hence, these cores communicate in many applications through the global memory. Access to this public memory takes long time and consumes large amount of power. Moreover, the memory bandwidth is limited which is quite challenging in parallel processing. The missed memory requests in last level cache that are followed by accesses...

FiRot: An efficient crosstalk mitigation method for Network-on-Chips

, Article Proceedings - 16th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2010, 13 December 2010 through 15 December 2010 ; December , 2010 , Pages 55-61 ; 9780769542898 (ISBN) Patooghy, A ; Shafaei, M ; Miremadi, S. G ; Falahati, H ; Taheri, S ; Sharif University of Technology

2010

Abstract

This paper proposes an efficient crosstalk mitigation method for Network-on-Chips (NoCs). The proposed method investigates flits in each packet to minimize the number of harmful transition patterns appearing on the communication channels of NoC. To do this, the content of every flit is rotated with respect to the previously flit sent through the channel. Rotation is done to find a rotated version of the flit which minimizes the number of harmful transition patterns. A tag field is added into the rotated flit to enable the receiving side to recover the original flit. Maximum number of rotations is bounded by a fixed value to minimize the timing and power overheads of the proposed method....

Neda: supporting direct inter-core neighbor data exchange in GPUs

, Article IEEE Computer Architecture Letters ; Volume 17, Issue 2 , 2018 , Pages 225-229 ; 15566056 (ISSN) Nematollahi, N ; Sadrosadati, M ; Falahati, H ; Barkhordar, M ; Sarbazi Azad, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

Image processing applications employ various filters for several purposes, such as enhancing the images and extracting the features. Recent studies show that filters in image processing applications take a substantial amount of the execution time, and it is crucial to boost their performance to improve the overall performance of the image processing applications. Image processing filters require a significant amount of data sharing among threads which are in charge of filtering neighbor pixels. Graphics Processing Units (GPUs) attempt to satisfy the demand of data sharing by providing the scratch-pad memory, shuffle instructions, and on-chip caches. However, we observe that these mechanisms...

Efficient nearest-neighbor data sharing in GPUs

, Article ACM Transactions on Architecture and Code Optimization ; Volume 18, Issue 1 , 2021 ; 15443566 (ISSN) Nematollahi, N ; Sadrosadati, M ; Falahati, H ; Barkhordar, M ; Drumond, M. P ; Sarbazi Azad, H ; Falsafi, B ; Sharif University of Technology

Association for Computing Machinery 2021

Abstract

Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data...

ITAP: Idle-time-aware power management for GPU execution units

, Article ACM Transactions on Architecture and Code Optimization ; Volume 16, Issue 1 , 2019 ; 15443566 (ISSN) Sadrosadati, M ; Ehsani, S. B ; Falahati, H ; Ausavarungnirun, R ; Tavakkol, A ; Abaee, M ; Orosa, L ; Wang, Y ; Sarbazi Azad, H ; Mutlu, O ; Sharif University of Technology

Association for Computing Machinery 2019

Abstract

Graphics Processing Units (GPUS) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUS suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU-the execution units-frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of...

Highly concurrent latency-tolerant register files for GPUs

, Article ACM Transactions on Computer Systems ; Volume 37, Issue 1-4 , 2021 ; 07342071 (ISSN) Sadrosadati, M ; Mirhosseini, A ; Hajiabadi, A ; Ehsani, S. B ; Falahati, H ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology

Association for Computing Machinery 2021

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical...