Loading...
Search for: thread-level-parallelism
0.005 seconds

    Proposing a Scalable and Energy-aware Architecture for Register File of GPUs

    , Ph.D. Dissertation Sharif University of Technology Sadrosadati, Mohammad (Author) ; Sarbazi-Azad, Hamid (Supervisor)
    Abstract
    Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. In this thesis, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register... 

    Facilitating Data Exchange Between Streaming Processors in GPUs

    , Ph.D. Dissertation Sharif University of Technology Nematollahizadeh Mahani, Negin Sadat (Author) ; Sadrbazi Azad, Hamid (Supervisor)
    Abstract
    GPUs are used today to accelerate a wide range of general-purpose applications (e.g., regular stencil applications and irregular graph processing applications). Due to the significant growth of the Thread Level Parallelism (TLP) in GPGPU applications, the need for data sharing between different threads has become more apparent. There are a variety of solutions for reusing data in the GPU (e.g., on-chip caches and shuffle instructions), each with its own drawbacks and reduced comprehensiveness (e.g., limiting shared memory to a thread block and shuffle instructions to a warp). Among the standard general-purpose applications, L1 data cache is a major GPU solution for reusing data among... 

    ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs

    , Article Microprocessors and Microsystems ; Volume 46 , 2016 , Pages 264-273 ; 01419331 (ISSN) Abbasitabar, H ; Samavatian, M. H ; Sarbazi Azad, H ; Sharif University of Technology
    Elsevier B.V  2016
    Abstract
    Spatial multi-programming is one of the most efficient multi-programming methods on Graphics Processing Units (GPUs). This multi-programming scheme generates variety in resource requirements of stream multiprocessors (SMs) and creates opportunities for sharing unused portions of each SM resource with other SMs. Although this approach drastically improves GPU performance, in some cases it leads to performance degradation due to the shortage of allocated resource to each program. Considering shared-memory as one of the main bottlenecks of thread-level parallelism (TLP), in this paper, we propose an adaptive shared-memory sharing architecture, called ASHA. ASHA enhances spatial... 

    Effective cache bank placement for GPUs

    , Article 20th Design, Automation and Test in Europe, DATE 2017, 27 March 2017 through 31 March 2017 ; 2017 , Pages 31-36 ; 9783981537093 (ISBN) Sadrosadati, M ; Mirhosseini, A ; Roozkhosh, S ; Bakhishi, H ; Sarbazi Azad, H ; ACM Special Interest Group on Design Automation (ACM SIGDA); Electronic System Design Alliance (ESDA); et al.; European Design and Automation Association (EDAA); European Electronic Chips and Systems Design Initiative (ECSI); IEEE Council on Electronic Design Automation (CEDA) ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2017
    Abstract
    The placement of the Last Level Cache (LLC) banks in the GPU on-chip network can significantly affect the performance of memory-intensive workloads. In this paper, we attempt to offer a placement methodology for the LLC banks to maximize the performance of the on-chip network connecting the LLC banks to the streaming multiprocessors in GPUs. We argue that an efficient placement needs to be derived based on a novel metric that considers the latency hiding capability of the GPUs through thread level parallelism. To this end, we propose a throughput aware metric, called Effective Latency Impact (ELI). Moreover, we define an optimization problem to formulate our placement approach based on the...