Loading...
Search for: graphics-processing-unit
0.007 seconds
Total 34 records

    A method for real-time safe navigation in noisy environments

    , Article 2013 18th International Conference on Methods and Models in Automation and Robotics, MMAR 2013, Miedzyzdroje ; 2013 , Pages 329-333 ; 9781467355063 (ISBN) Neyshabouri, S. A. S ; Kamali, E ; Niknezhad, M. R ; Monfared, S. S. M. S ; Sharif University of Technology
    2013
    Abstract
    The challenge of finding an optimized and reliable path dates back to emersion of mobile robots. Several approaches have been developed that have partially answered this need. Satisfying results in previous implementations has led to an increased utilization of sampling-based motion planning algorithms in recent years, especially in high degrees of freedom (DOF), fast evolving environments. Another advantage of these algorithms is their probabilistic completeness that guarantees delivery of a path in sufficient time, if one exists. On the other hand, sampling based motion planners leave no comment on safety of the planned path. This paper suggests biasing the Rapidly-exploring Random Trees... 

    Modeling the effect of process variations on the delay and power of the digital circuit using fast simulators

    , Article 2013 21st Iranian Conference on Electrical Engineering, ICEE 2013 ; 2013 , 14-16 May ; 9781467356343 (ISBN) Amirsoleimani, A ; Soleimani, H ; Ahmadi, A ; Bavandpour, M ; Zwolinski, M ; Sharif University of Technology
    2013
    Abstract
    Process variation has an increasingly dramatic effect on delay and power as process geometries shrink. Even if the amount of variation remains the same as in previous generations, it accounts for a greater percentage of process geometries as they get smaller. So an accurate prediction of path delay and power variability for real digital circuits in the current technologies is very important; however, its main drawback is the high runtime cost. In this paper, we present a new fast EDA tool which accelerates Monte Carlo based statistical static timing analysis (SSTA) for complex digital circuit. Parallel platforms like Message Passing Interface and POSIX® Threads and also the GPU-based CUDA... 

    A GPU based simulation platform for adaptive frequency Hopf oscillators

    , Article ICEE 2012 - 20th Iranian Conference on Electrical Engineering ; 2012 , Pages 884-888 ; 9781467311489 (ISBN) Soleimani, H ; Maleki, M. A ; Ahmadi, A ; Bavandpour, M ; Maharatna, K ; Zwolinski, M ; Sharif University of Technology
    2012
    Abstract
    In this paper we demonstrate a dynamical system simulator that runs on a single GPU. The model (running on an NVIDIA GT325M with 1GB of memory) is up to 50 times faster than a CPU version when more than 10 million adaptive Hopf oscillators have been simulated. The simulation shows that the oscillators tune to the correct frequencies for both discrete and continuous spectra. Due to its dynamic nature the system is also capable to track non-stationary spectra. With the help of this model the frequency spectrum of an ECG signal (as a non-stationary signal) obtained and was showed that frequency domain representation of signal (i.e. FFT) is the same as one MATLAB generates  

    GPU implementation of split-field finite difference time-domain method for drudelorentz dispersive media

    , Article Progress in Electromagnetics Research ; Volume 125 , 2012 , Pages 55-77 ; 10704698 (ISSN) Shahmansouri, A ; Rashidian, B ; Sharif University of Technology
    2012
    Abstract
    Split-field finite-difference time-domain (SF-FDTD) method can overcome the limitation of ordinary FDTD in analyzing periodic structures under oblique incidence. On the other hand, huge run times of 3D SF-FDTD, is practically a major burden in its usage for analysis and design of nanostructures, particularly when having dispersive media. Here, details of parallel implementation of 3D SF-FDTD method for dispersive media, combined with totalfield/ scattered-field (TF/SF) method for injecting oblique plane wave, are discussed. Graphics processing unit (GPU) has been used for this purpose, and very large speed up factors have been achieved. Also a previously reported formulation of SF-FDTD based... 

    Adaptive sparse matrix representation for efficient matrix–vector multiplication

    , Article Journal of Supercomputing ; November , 2015 , Pages 1-21 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology
    Springer New York LLC  2015
    Abstract
    A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for... 

    Power Reduction in GPUs through Intra-Warp Instruction Execution Reordering

    , M.Sc. Thesis Sharif University of Technology Aghilinasab, Homa (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    As technology shrinks, the static power consumption is getting worse. Moreover, considering high usage of General-Purpose Graphics Processing Units (GPGPU), reducing the static power of GPGPUs is becoming an important issue. Execution units in GPGPUs are one of the most power hungry units that play an essential role in total power consumption of GPGPUs. On the other hand power gating is a promising method to reduce static power consumption. In this project, we propose a novel method to implement power-gating method for execution units with the negligible performance and power overheads. We utilize out of order execution in intra warp to keep the power-gated resources in off state more than... 

    Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality

    , Article IET Computers and Digital Techniques ; Volume 9, Issue 5 , August , 2015 , Pages 275-282 ; 17518601 (ISSN) Keshtegar, M. M ; Falahati, H ; Hessabi, S ; Sharif University of Technology
    Institution of Engineering and Technology  2015
    Abstract
    Owing to a new platform for high performance and general-purpose computing, graphics processing unit (GPU) is one of the most promising candidates for faster improvement in peak processing speed, low latency and high performance. As GPUs employ multithreading to hide latency, there is a small private data cache in each single instruction multiple thread (SIMT) core. Hence, these cores communicate in many applications through the global memory. Access to this public memory takes long time and consumes large amount of power. Moreover, the memory bandwidth is limited which is quite challenging in parallel processing. The missed memory requests in last level cache that are followed by accesses... 

    ASHA: An adaptive shared-memory sharing architecture for multi-programmed GPUs

    , Article Microprocessors and Microsystems ; Volume 46 , 2016 , Pages 264-273 ; 01419331 (ISSN) Abbasitabar, H ; Samavatian, M. H ; Sarbazi Azad, H ; Sharif University of Technology
    Elsevier B.V  2016
    Abstract
    Spatial multi-programming is one of the most efficient multi-programming methods on Graphics Processing Units (GPUs). This multi-programming scheme generates variety in resource requirements of stream multiprocessors (SMs) and creates opportunities for sharing unused portions of each SM resource with other SMs. Although this approach drastically improves GPU performance, in some cases it leads to performance degradation due to the shortage of allocated resource to each program. Considering shared-memory as one of the main bottlenecks of thread-level parallelism (TLP), in this paper, we propose an adaptive shared-memory sharing architecture, called ASHA. ASHA enhances spatial... 

    Adaptive sparse matrix representation for efficient matrix–vector multiplication

    , Article Journal of Supercomputing ; Volume 72, Issue 9 , Volume 72, Issue 9 , 2016 , Pages 3366-3386 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology
    Springer New York LLC 
    Abstract
    A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for... 

    Improving Performance of GPGPU Considering Reliability Requirements

    , M.Sc. Thesis Sharif University of Technology Motallebi, Maryam (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    In recent years, GPUs are becoming ideal candidates for processing a variety of high performance applications. By relying on thousands of concurrent threads in applications and the computational power of large numbers of computing units, GPGPUs have provided high efficiency and throughput. To achieve the potential computational power of GPGPUs in broader types of applications, we need to apply some modifications. By understanding the features and properties of applications, we can execute them in a more proper way on GPUs. Therefore, considering applications’ behavior, we define 5 different categories for them. Every category has special definitions, and we change the configuration of GPU... 

    BiNoCHS: bimodal network-on-chip for CPU-GPU heterogeneous systems

    , Article 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, 19 October 2017 through 20 October 2017 ; 2017 ; 9781450349840 (ISBN) Mirhosseini, A ; Sadrosadati, M ; Soltani, B ; Sarbazi Azad, H ; Wenisch, T. F ; Sharif University of Technology
    Abstract
    CPU-GPU heterogeneous systems are emerging as architectures of choice for high-performance energy-efficient computing. Designing on-chip interconnects for such systems is challenging; CPUs typically benefit greatly from optimizations that reduce latency, but rarely saturate bandwidth or queueing resources. In contrast, GPUs generate intense traffic that produces local congestion, harming CPU performance. Congestion-optimized interconnects can mitigate this problem through larger virtual and physical channel resources. However, when there is little traffic, such networks become suboptimal due to higher unloaded packet latencies and critical path delays.We argue for a reconfigurable network... 

    Design and Hardware Implementation of Optical Character Recognition

    , M.Sc. Thesis Sharif University of Technology Dezfuli, Sina (Author) ; Hashemi, Matin (Supervisor)
    Abstract
    The objective of OCR systems is to retrieve machine-encoded text from a raster image. Despite the abundance of powerful OCR algorithms for English, there are not many for Farsi. Our proposed algorithm is comprised of pre-processing, line detection, sub-word detection and segmentation, feature extraction and classification. Furthermore, hardware implementation and acceleration of this system on a GPGPU is presented. This algorithm was tested on 5 fonts including Titr, Lotus,Yekan, Koodak and Nazanin and an average accuracy above 90% was achieved  

    A Reconfigurable and Adaptive Shared-memory Architecture for GPUs

    , M.Sc. Thesis Sharif University of Technology Abbasitabar, Hamed (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    The importance of shared memory (scratchpad memory) in GPGPU programming, the memory size limits of GPGPUs and the influence of shared memory on overall performance of the GPGPU has led to its performance optimization. Moreover, the trend of new GPGPUs design shows that the ratio of shared memory to processing elements is going smaller. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this thesis we introduced a reconfigurable and adaptive shared memory architecture for GPGPUs based on resource sharing which can be exploited for throughput improvement... 

    3-point RANSAC for fast vision based rotation estimation using GPU technology

    , Article IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems9 February 2017 ; 2017 , Pages 212-217 ; 9781467397087 (ISBN) Kamran, D ; Manzuri, M. T ; Marjovi, A ; Karimian, M ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2017
    Abstract
    In many sensor fusion algorithms, the vision based RANdom Sample Consensus (RANSAC) method is used for estimating motion parameters for autonomous robots. Usually such algorithms estimate both translation and rotation parameters together which makes them inefficient solutions for merely rotation estimation purposes. This paper presents a novel 3-point RANSAC algorithm for estimating only the rotation parameters between two camera frames which can be utilized as a high rate source of information for a camera-IMU sensor fusion system. The main advantage of our proposed approach is that it performs less computations and requires fewer iterations for achieving the best result. Despite many... 

    Neda: supporting direct inter-core neighbor data exchange in GPUs

    , Article IEEE Computer Architecture Letters ; Volume 17, Issue 2 , 2018 , Pages 225-229 ; 15566056 (ISSN) Nematollahi, N ; Sadrosadati, M ; Falahati, H ; Barkhordar, M ; Sarbazi Azad, H ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2018
    Abstract
    Image processing applications employ various filters for several purposes, such as enhancing the images and extracting the features. Recent studies show that filters in image processing applications take a substantial amount of the execution time, and it is crucial to boost their performance to improve the overall performance of the image processing applications. Image processing filters require a significant amount of data sharing among threads which are in charge of filtering neighbor pixels. Graphics Processing Units (GPUs) attempt to satisfy the demand of data sharing by providing the scratch-pad memory, shuffle instructions, and on-chip caches. However, we observe that these mechanisms... 

    Viterbi Decoder Implementation on GPGPU

    , M.Sc. Thesis Sharif University of Technology Mohammadidoost, Alireza (Author) ; Hashemi, Matin (Supervisor)
    Abstract
    In this project, a method is emoloyed to implement a Viterbi decoder on GPGPU. This method is based on combining all steps of the algorithm. This combination has some challenges that are related to differences between different steps of the algorithm. So in this project, some solutions are found to handle these challenges and a high-throughput Viterbi decoder is acheived  

    LTRF: enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching

    , Article 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, 24 March 2018 through 28 March 2018 ; 2018 , Pages 489-502 ; 9781450349116 (ISBN) Sadrosadati, M ; Mirhosseini, A ; Ehsani, S. B ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology
    Association for Computing Machinery  2018
    Abstract
    Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical... 

    Energy Reduction in GPGPUs

    , Ph.D. Dissertation Sharif University of Technology Falahati, Hajar (Author) ; Hessabi, Shahin (Supervisor) ; Baniasadi, Amirali (Co-Advisor)
    Abstract
    The number of transistors on a single chip is growing exponentially which results in a huge increase in consumed power and temperature. Parallel processing is a solution which concentrates on increasing the number of cores instead of improving single thread performance. Graphics Processing Units (GPUs) are parallel accelerators which are categorized as manycore systems. However, recent research shows that their consumed power and energy are increasing. In this research, we aim to propose methods to make GPGPUs energy effiecient. In this regard, we evaluated the detailed consummed power of GPGPUs. Our results show that memory sub-system is a critical bottelneck in terms of performance and... 

    Data Sharing Aware Scheduling for Reducing Memory Accesses in GPGPUs

    , M.Sc. Thesis Sharif University of Technology Saber Latibari, Banafsheh (Author) ; Hesabi, Shahin (Supervisor)
    Abstract
    Access to global memory is one of the bottlenecks in performance and energy in GPUs. Graphical processors uses multi-thread in streaming multiprocessors to reduce memory access latency. However, due to the high number of concurrent memory requests, the memory bandwidth of low level memorties and the interconnection network are quickly saturated. Recent research suggests that adjacent thread blocks share a significant amount of data blocks. If the adjacent thread blocks are assigned to specific streaming multiprocessor, shared data block can be rused by these thread blocks. However the thread block scheduler assigns adjacent thread blocks to different streaming multiprocessors that increase... 

    Unifying L1 Data Cache and Shared Memory in GPUs

    , M.Sc. Thesis Sharif University of Technology Yousefzadeh-Asl-Miandoab, Ehsan (Author) ; Sarbazi Azad, Hamid (Supervisor)
    Abstract
    Graphics Processing Units (GPUs) employ a scratch-pad memory (a.k.a., shared memory) in each streaming multiprocessor to accelerate data sharing among the threads in a thread block and provide a software-managed cache for the programmers.However, we observe that about 60% of GPU workloads of several well-known benchmark suites do not use shared memory. Morever, among those workloads that use shared memory, about 42% of shared memory is not utilized, on average. On the other hand, we observe that many general purpose GPU applications suffer from the low hit rate and limited bandwidth of L1 data cache.We aim to use shared memory space and its corrsponding bandwidth for improving L1 data cache,...