Divide and conquer frontend bottleneck

Ansari, A; Lotfi Kamran, P Sarbazi Azad, H Sharif University of Technology

Please enable javascript in your browser.

Divide and conquer frontend bottleneck

Ansari, A ; Sharif University of Technology | 2020

267 Viewed

Type of Document: Article
DOI: 10.1109/ISCA45697.2020.00017
Publisher: Institute of Electrical and Electronics Engineers Inc , 2020
Abstract:
The frontend stalls caused by instruction and BTB misses are a significant source of performance degradation in server processors. Prefetchers are commonly employed to mitigate frontend bottleneck. However, next-line prefetchers, which are available in server processors, are incapable of eliminating a considerable number of L1 instruction misses. Temporal instruction prefetchers, on the other hand, effectively remove most of the instruction and BTB misses but impose significant area overhead. Recently, an old idea of using BTB-directed instruction prefetching is revived to address the limitations of temporal instruction prefetchers. While this approach leads to prefetchers with low area overhead, it requires significant changes to the frontend of a processor. Moreover, as this approach relies on the BTB content for prefetching, BTB misses stall the prefetcher, and likely lead to costly instruction misses. Especially as instruction misses are usually more expensive than BTB misses, the dependence of instruction prefetching to the BTB content is harmful to workloads with very large instruction footprints. Moreover, BTB-directed instruction prefetchers, as proposed in prior work, cannot be applied to variable-length ISAs. In this work, we showcase the harmful effects of making instruction prefetchers depend on the BTB content. Moreover, we divide the frontend bottleneck into three categories and use a divide-and-conquer approach to propose simple and effective solutions for each one. Sequential misses can be covered by an accurate and timely sequential prefetcher named SN4L, a lightweight discontinuity prefetcher named Dis eliminates discontinuity misses, and the BTB misses are reduced by pre-decoding the prefetched blocks. We also discuss how our proposal can be used for variable-length ISAs with low storage overhead. Our proposal, SN4L+ Dis+BTB, imposes the same area overhead as the state-of-the-art BTB-directed prefetcher, and at the same time, outperforms it by 5% on average and up to 16%. © 2020 IEEE
Keywords:
Frontend bottleneck ; Instruction and BTB prefetching ; Divide and conquer ; Divide-and-conquer approach ; Effective solution ; Instruction prefetching ; Performance degradation ; Server processors ; State of the art ; Three categories ; Computer architecture
Source: 47th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2020, 30 May 2020 through 3 June 2020 ; Volume 2020-May , 2020 , Pages 65-78
URL: https://ieeexplore.ieee.org/document/9138943

Friend's email
Your name
Your email
enter code