Loading...
Search for: processing-elements
0.011 seconds

    ISP: Using idle SMs in hardware-based prefetching

    , Article Proceedings - 17th CSI International Symposium on Computer Architecture and Digital Systems, CADS 2013 ; October , 2013 , Pages 3-8 ; 9781479905621 (ISBN) Falahati, H ; Abdi, M ; Baniasadi, A ; Hessabi, S ; Computer Society of Iran; IPM ; Sharif University of Technology
    IEEE Computer Society  2013
    Abstract
    The Graphics Processing Unit (GPU) is the most promising candidate platform for faster rate of improvement in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architecture challenges. In this paper, we focus on improving performance by better hiding long waiting time to transfer data from the slow global memory. Thereupon study an effective light-overhead prefetching mechanism, which utilizes idle processing elements. Our results show that we can potentially improve... 

    Power-efficient prefetching on GPGPUs

    , Article Journal of Supercomputing ; Volume 71, Issue 8 , August , 2015 , pp. 2808-2829 ; ISSN: 09208542 Falahati, H ; Hessabi, S ; Abdi, M ; Baniasadi, A ; Sharif University of Technology
    Abstract
    The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architectural challenges. In this paper, we focus on improving performance by better hiding long waiting time for transferring data from the slow global memory. Furthermore, we show that the proposed method can reduce power and energy. Reduction in access time to off-chip data has a noticeable role in reducing... 

    Reduced communications fault tolerant task scheduling algorithm for multiprocessor systems

    , Article Procedia Engineering ; Volume 29 , 2012 , Pages 3820-3825 ; 18777058 (ISSN) Tabbaa, N ; Entezari Maleki, R ; Movaghar, A ; Sharif University of Technology
    Abstract
    Multiprocessor systems have been widely used for the execution of parallel applications. Task scheduling is crucial for the right operation of multiprocessor systems, where the aim is shortening the length of schedules. Fault tolerance is becoming a necessary attribute in multiprocessor systems as the number of processing elements is getting larger. This paper presents a fault tolerant scheduling algorithm for task graph applications in multiprocessor systems. The algorithm is an extension of a previously proposed algorithm with a reduced communications scheme. Simulation results show the efficiency of the proposed algorithm despite its simplicity  

    A parallel clustering algorithm on the star graph and its performance

    , Article Mathematical and Computer Modelling ; Volume 58, Issue 3-4 , 2013 , Pages 880-891 ; 08957177 (ISSN) Sarbazi Azad, H ; Zarandi, H. R ; Fazeli, M ; Sharif University of Technology
    Abstract
    In this paper, a parallel algorithm is presented for data clustering on a multicomputer with star topology. This algorithm is fast and requires a small amount of memory per processing element, which makes it even suitable for SIMD implementation. The proposed parallel algorithm completes in O(K+S2-T2) steps for a clustering problem of N data patterns with M features per pattern and K clusters where S and T are the minimum numbers such that NM≤S! and KM≤T!, on the S-dimensional star graph  

    Parallel clustering on the star graph

    , Article 6th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP, Melbourne, 2 October 2005 through 3 October 2005 ; Volume 3719 LNCS , 2005 , Pages 287-292 ; 03029743 (ISSN); 3540292357 (ISBN); 9783540292357 (ISBN) Fazeli, M ; Sarbazi Azad, H ; Farivar, R ; Sharif University of Technology
    2005
    Abstract
    In this paper, a parallel algorithm for data clustering is presented on a multi-computer with star topology. This algorithm is fast and requires a small amount of memory per processing element, which makes it even suitable for SIMD implementation. The proposed parallel algorithm completes in O(K+S 2-T 2) steps for a clustering problem of N data patterns with M features per pattern and K clusters, where N.M = S!, K.M = T!, and M=R!, on a s-star interconnection network. © Springer-Verlag Berlin Heidelberg 2005  

    Accelerating Perfect and Imperfect Loops Using Reconfigurable Architectures

    , M.Sc. Thesis Sharif University of Technology Tanhaee, Effat (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    With the widespread use of mobile applications, multimedia and telecommunications, speed of execution has become important. The computation-intensive portions of applications, i.e., loops, devote a significant percentage of their implementation time. Thus, in this thesis, a new method is introduced which greatly increases the execution speed of the loops. Loops are often implemented on coarse-grained reconfiguration architecture (CGRAs) for acceleration, which is a promising architecture with high performance and high power efficiency in comparison to FPGA. In this regard, to reduce the execution time of two-level nested loops, if there are several innermost loops, first, we fuse them, then... 

    Temperature control in three-network on chips using task migration

    , Article IET Computers and Digital Techniques ; Vol. 7, issue. 6 , November , 2013 , pp. 274-281 ; 1751-861X (online) Hassanpour, N ; Hessabi, H ; Hamedani, P. K ; Sharif University of Technology
    Abstract
    Combination of three-dimensional (3D) IC technology and network on chip (NoC) is an effective solution to increase system scalability and also alleviate the interconnect problem in large-scale integrated circuits. However, because of the increased power density in 3D NoC systems and the destructive effect of high temperatures on chip reliability, applying thermal management solutions becomes crucial in such circuits. In this study, the authors propose a runtime distributed migration algorithm based on game theory to balance the heat dissipation among processing elements (PEs) in a 3D NoC chip multiprocessor. The objective of this algorithm is to minimise the 3D NoC system's peak temperature,... 

    Temperature control in three-network on chips using task migration

    , Article IET Computers and Digital Techniques ; Volume 7, Issue 6 , 2013 , Pages 274-281 ; 17518601 (ISSN) Hassanpour, N ; Hessabi, H ; Hamedani, P. K ; Sharif University of Technology
    2013
    Abstract
    Combination of three-dimensional (3D) IC technology and network on chip (NoC) is an effective solution to increase system scalability and also alleviate the interconnect problem in large-scale integrated circuits. However, because of the increased power density in 3D NoC systems and the destructive effect of high temperatures on chip reliability, applying thermal management solutions becomes crucial in such circuits. In this study, the authors propose a runtime distributed migration algorithm based on game theory to balance the heat dissipation among processing elements (PEs) in a 3D NoC chip multiprocessor. The objective of this algorithm is to minimise the 3D NoC system's peak temperature,... 

    A novel algorithm in a linear phased array system for side lobe and grating lobe level reduction with large element spacing

    , Article Analog Integrated Circuits and Signal Processing ; Volume 104, Issue 3 , 13 March , 2020 , Pages 265-275 Khalilpour, J ; Ranjbar, J ; Karami, P ; Sharif University of Technology
    Springer  2020
    Abstract
    Phased array antennas are generally used for the inherent flexibility to beamforming and null-steering electronically. In the phased arrays the side lobes level (SLL) level is main problem which causes waste of energy or saturation of the receiver in the case of presence of the strong spatial blockers. In this paper, a weighting method was first used to reduce the level of SLL. However, this method increased the beam width and reduced resolution, which is not suitable for track applications. In next step hoping to increase the resolution, the distance between the antennas increased. But in this way, grating lobes appeared in the final beam. In fact, the main idea of the article is to solve... 

    Schedule swapping: A technique for temperature management of distributed embedded systems

    , Article Proceedings - IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, EUC 2010, 11 December 2010 through 13 December 2010, Hong Kong ; 2010 , Pages 1-6 ; 9780769543222 (ISBN) Samie Ghahfarokhi, F ; Ejlali, A ; Sharif University of Technology
    2010
    Abstract
    A distributed embedded system consists of different processing elements (PEs) communicating via communication links. PEs have various power characteristics and in turn, have different thermal profiles. With new technologies, processor power density is dramatically increased which results in high temperature. This alarming trend underscores the importance of temperature management methods in system design. The majority of proposed techniques to address thermal issues, impose severe penalties on performance and reliability. We present Schedule Swapping, a technique for reducing peak temperature in distributed embedded systems while satisfying real-time constraints. Contrary to many other... 

    NoC design methodologies for heterogeneous architecture

    , Article Proceedings - 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2020, 11 March 2020 through 13 March 2020 ; 2020 , Pages 299-306 Alhubail, L ; Jasemi, M ; Bagherzadeh, N ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2020
    Abstract
    Fused CPU-GPU architectures that utilize the powerful features of both processors are common nowadays. Using homogeneous interconnect for such heterogeneous processors can result in performance degradation and power increase. This paper explores the optimization of heterogeneous NoC design to connect heterogeneous CPU-GPU architecture in terms of NoC performance and power. This involves solving four different NoC design sub-problems simultaneously; processing elements (PEs) mapping, buffer size and virtual channel assignments, and links' bandwidth determination. Heuristic-based optimization methods were proposed to obtain a near-optimal heterogeneous NoC design, and formal models were used...