Loading...

Energy Reduction in GPGPUs

Falahati, Hajar | 2016

609 Viewed
  1. Type of Document: Ph.D. Dissertation
  2. Language: Farsi
  3. Document No: 49132 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Hessabi, Shahin; Baniasadi, Amirali
  7. Abstract:
  8. The number of transistors on a single chip is growing exponentially which results in a huge increase in consumed power and temperature. Parallel processing is a solution which concentrates on increasing the number of cores instead of improving single thread performance. Graphics Processing Units (GPUs) are parallel accelerators which are categorized as manycore systems. However, recent research shows that their consumed power and energy are increasing. In this research, we aim to propose methods to make GPGPUs energy effiecient. In this regard, we evaluated the detailed consummed power of GPGPUs. Our results show that memory sub-system is a critical bottelneck in terms of performance and energy. We also figured out that CTAs tend to share the same input data. However, CTA schedulers of the state-of-the-art GPUs distribute CTAs to different SMs in a round-robin fashion for better parallelism. Thus, the data sharing across CTAs translates into the redundant data copies brought into the L1 caches of multiple SMs. For each L1 cache miss, SMs simply access the lower-level memories to get the data, even when the data is available in a few adjacent SMs. Moreover, on-chip L1 cache is not large enough to capture all the intra-SM data sharing as well. These redundant lower-level memory accesses cause unnecessary resource and bandwidth waste across the entire memory hierarchy, which also exacerbate energy overhead.
    To tackle memory subsystem challenge, we explored methods to capture intra-SM and inter-SM data sharing. To capture intra-SM sharing, we proposed Idle SM Prefetching (ISP) which executes memory instructions in advane and store the prefetched data in the texture cache. ISP imrpoves average perofrmance by 17%, on average. It also decreases cosnumed power and energy by 14% and 29%, on average, respectively. Then, we proposed an inter-SM sharing approach to capture shared data among adjacent SMs where L1 cache misses of an SM is opportunistically serviced by neighboring SMs. First, we propose a serial approach, called Inter-SM Cluster Sharing (ISMC), which looks up miss requests in the neigbor SMs one by one. This serial approach imposes delay and power overhad. As we noted earlier, not every data is shared. Hence, a sharing predictor is critical to reduce unnecessary tag lookups in the neighboring SMs. At the next step, we proposed a new way of data sharing across SMs, namely energy efficient data sharing (E2DS). To mitigate the complexity of sharing, we limit the sharing to be within a cluster of fixed number of neighboring SMs. It utilizes a simple two-bit predictor which predicts the existence of the data in the cluster. Only when the data is predicted to be shared in the cluster, the cache miss request is sent to the SMs in the cluster. Our evaluation shows that the accuracy of our predictor is nearly 87%. Average of 31.1% L1 cache misses are satisfied by the SM-level data exchange. These requests are handled without the burden of accessing L2 and off-chip memory. Through detailed software simulations and gate-level design explorations, we show that our approach improves performance by 13%, on average, while decreases power and energy by 10% and 20.4%, on average, respectively. We extended E2DS to provide data sharing for write requests by modifying the cache policy; namelly E2DS+W. We also show that E2DS works much better than throtthlig approaches which try to limit the number of running threads. TE2DS is an extention which provides sharing mechanism while managing the number of running threads
  9. Keywords:
  10. Power Consumption ; Energy Consumption ; General Purpose Graphic Processing Units (GPGPU) ; Energy Reduction ; Data Shairing ; Memory Bottelneek

 Digital Object List

 Bookmark

No TOC