Loading...

Facilitating Data Exchange Between Streaming Processors in GPUs

Nematollahizadeh Mahani, Negin Sadat | 2021

328 Viewed
  1. Type of Document: Ph.D. Dissertation
  2. Language: Farsi
  3. Document No: 54424 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Sadrbazi Azad, Hamid
  7. Abstract:
  8. GPUs are used today to accelerate a wide range of general-purpose applications (e.g., regular stencil applications and irregular graph processing applications). Due to the significant growth of the Thread Level Parallelism (TLP) in GPGPU applications, the need for data sharing between different threads has become more apparent. There are a variety of solutions for reusing data in the GPU (e.g., on-chip caches and shuffle instructions), each with its own drawbacks and reduced comprehensiveness (e.g., limiting shared memory to a thread block and shuffle instructions to a warp). Among the standard general-purpose applications, L1 data cache is a major GPU solution for reusing data among processing threads. But the high rate of data reuse among processing threads leads to issues such as increasing the number of redundant accesses, increasing L1 data cache contention, reducing effective bandwidth, and performance loss. As a result, we are looking for mechanisms to provide the data required by the processing threads at a higher rate. To this end, we have divided the reuse pattern of data between threads into two general categories: 1) data sharing with neighboring threads, and 2) data sharing with non-neighboring threads. Sharing between neighboring threads includes shared data between stream processors that are adjacent to each other. In the other category, processors are not adjacent to each other, but share data. Data sharing between neighboring stream processors is more common in stencil applications. The stencil computation pattern is used in many real applications (e.g., image processing, machine learning, and scientific applications). To reuse the neighbor data, we propose NeDa mechanism. NeDa is a direct neighbor data sharing mechanism that uses two registers embedded in each streaming processor that can be accessed by neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate shared accesses. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. NeDa has an area overhead of 1.3%. Irregular applications that do not follow a specific rule for data reuse, cannot take advantage of NeDa mechanism. In order to provide faster data reuse in this category of computations, we propose LoTUS, a minimally sized fully associative L0 cache that captures the primary working set of latency-sensitive data-parallel applications while dramatically reducing load-to-use latency using conventional high-performance low-density SRAM cells. Results from cycle-accurate simulation indicate an average performance improvement of 21.6% and an average energy reduction of 26.4% for the irregular workloads. LoTUS has an area overhead of 1%
  9. Keywords:
  10. Data Shairing ; Cache Memory ; Graphic Processing ; Core-to-Core Communication ; Thread Level Parallelism ; Streaming Algorithm

 Digital Object List

 Bookmark

No TOC