Sharif Digital Repository / Sharif University of Technology / Search result

Accelerating the Rijndael algorithm using custom instructions capability of Nios II in ODYSSEY

, Article Proceedings - 2006 International Conference on Design and Test of Integrated Systems in Nanoscale Technology, IEEE DTIS 2006 ; 2006 , Pages 69-73 ; 0780397266 (ISBN); 9780780397262 (ISBN) Iraji, R ; Hessabi, S ; Moghadam, E. K ; Sharif University of Technology

IEEE Computer Society 2006

Abstract

The ODYSSEY design methodology is an object-oriented design methodology which models a system in terms of its constituting objects and their corresponding method calls. Some of these method calls are implemented in hardware functional units, while others are simply executed by a general-purpose processor. There is a communication overhead because functional units must communicate with each other and with the processor core. In this paper we utilize the custom instructions capability of Nios II processor to enhance the performance of our ASIP. Since these instructions are in the processor itself, there will be no communication overhead for using them. We analyze the performance of the...

Adaptive sparse matrix representation for efficient matrix–vector multiplication

, Article Journal of Supercomputing ; November , 2015 , Pages 1-21 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology

Springer New York LLC 2015

Abstract

A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for...

Adaptive sparse matrix representation for efficient matrix–vector multiplication

, Article Journal of Supercomputing ; Volume 72, Issue 9 , Volume 72, Issue 9 , 2016 , Pages 3366-3386 ; 09208542 (ISSN) Zardoshti, P ; Khunjush, F ; Sarbazi Azad, H ; Sharif University of Technology

Springer New York LLC

Abstract

A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices while not working well for some others. The existing studies tend to process all types of applications, e.g., the most popular application which is matrix–vector multiplication, with different sparse matrix structures using a fixed representation. While Graphics Processing Units (GPUs) have evolved into a very attractive platform for general purpose computations, most of the existing works on sparse matrix–vector multiplication (SpMV, for...

Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality

, Article IET Computers and Digital Techniques ; Volume 9, Issue 5 , August , 2015 , Pages 275-282 ; 17518601 (ISSN) Keshtegar, M. M ; Falahati, H ; Hessabi, S ; Sharif University of Technology

Institution of Engineering and Technology 2015

Abstract

Owing to a new platform for high performance and general-purpose computing, graphics processing unit (GPU) is one of the most promising candidates for faster improvement in peak processing speed, low latency and high performance. As GPUs employ multithreading to hide latency, there is a small private data cache in each single instruction multiple thread (SIMT) core. Hence, these cores communicate in many applications through the global memory. Access to this public memory takes long time and consumes large amount of power. Moreover, the memory bandwidth is limited which is quite challenging in parallel processing. The missed memory requests in last level cache that are followed by accesses...