Loading...

Efficient nearest-neighbor data sharing in GPUs

Nematollahi, N ; Sharif University of Technology | 2021

393 Viewed
  1. Type of Document: Article
  2. DOI: 10.1145/3429981
  3. Publisher: Association for Computing Machinery , 2021
  4. Abstract:
  5. Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa's performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange. © 2020 ACM
  6. Keywords:
  7. Benchmarking ; Cache memory ; Codes (symbols) ; Computer graphics ; Computer graphics equipment ; Electronic data interchange ; Graphics processing unit ; Image processing ; Program processors ; Benchmark suites ; Cycle-accurate simulation ; Exchange mechanism ; General purpose graphics processing unit (GPGPU) ; Nearest neighbors ; Power reductions ; Scientific applications ; Sharing mechanism ; Data Sharing
  8. Source: ACM Transactions on Architecture and Code Optimization ; Volume 18, Issue 1 , 2021 ; 15443566 (ISSN)
  9. URL: https://dl.acm.org/doi/10.1145/3429981