Search for: program-processors
Total 61 records
Article 2003 IEEE International Conference on Accoustics, Speech, and Signal Processing, Hong Kong, 6 April 2003 through 10 April 2003 ; Volume 6 , 2003 , Pages 401-404 ; 15206149 (ISSN) ; Kasaei, S ; Sharif University of Technology
Conventional implementation of multi-dimensional wavelet transform (e.g. 3-D wavelet) requires whether a high amount of "in access" memory or a continual access to slow memory of a processor which makes it infeasible for most applications. In this paper, we proposed a novel algorithm for computation of an n-D discrete wavelet transform (DWT) based on lifting scheme. In addition to benefits of lifting scheme (which causes a major reduction in computational complexity and performs the total computations in time domain), our real-time approach computes the coefficients for all kinds of 1st and 2 nd generation wavelets with short delay and optimized utilization of the slow and fast memories of a...
Article IEEE International Conference on Computer Systems and Applications, 2006, Sharjah, 8 March 2006 through 8 March 2006 ; Volume 2006 , 2006 , Pages 673-679 ; 1424402123 (ISBN); 9781424402120 (ISBN) ; Sarbazi Azad, H ; Sharif University of Technology
IEEE Computer Society 2006
In this paper, we show that by considering the factor of usage in instruction bundles in VLIW processors and using the slots filled with NOPs in bundles, we can improve the overall performance by reducing the total execution time of the program. By our proposed scheme, Combined Bundle Scheduling (CBS), we have gained better performance compared to that for the PDT scheme (Predicted Decision Tree scheduling) which is the best scheduling strategy known so far. © 2006 IEEE
Article 17th 2005 International Conference on Microelectronics, ICM 2005, Islamabad, 13 December 2005 through 15 December 2005 ; Volume 2005 , 2005 , Pages 310-317 ; 0780392620 (ISBN); 9780780392625 (ISBN) ; Hessabi, S ; Goudarzi, M ; Sharif University of Technology
In this paper, we present an MPEG-2 video decoder implemented in our ODYSSEY design methodology. We start with an ASIP tailored to the JPEG decompression algorithm. We extend that ASIP by required software routines such that the extended ASIP can now perform MPEG2 decoding while still benefiting from hardware units common between JPEG and MPEG2. This demonstrates the ability of our approach in extending an already manufactured ASIP, which was tailored to a given application, such that it implements new, yet related applications. The implementation platform is a VirtexII-Pro FPGA. The hardware part is implemented in VHDL, and the software runs on a PowerPC processor. Experimental results show...
Article Journal of Electronic Testing: Theory and Applications (JETTA) ; Volume 24, Issue 1-3 , 2008 , Pages 21-33 ; 09238174 (ISSN) ; Farivar, R ; Miremadi, S. G ; Sharif University of Technology
This paper presents a behavior-based error detection technique called Control Flow Checking using Branch Trace Exceptions for PowerPC processors family (CFCBTE). This technique is based on the branch trace exception feature available in the PowerPC processors family for debugging purposes. This technique traces the target addresses of program branches at run-time and compares them with reference target addresses to detect possible violations caused by transient faults. The reference target addresses are derived by a preprocessor from the source program. To enhance the error detection coverage, three other mechanisms, i.e., Machine Check Exception, System Trap Instructions and Work Load Timer...
Article 6th International Conference on Networking, ICN'07, Sainte-Luce, Martinique, 22 April 2007 through 28 April 2007 ; 2007 , Pages 40-45 ; 0769528058 (ISBN); 9780769528052 (ISBN) ; Bidmeshki, M. M ; Miremadi, S. G ; Sharif University of Technology
IEEE Computer Society 2007
By occurring failures in computer networks, routing protocols are triggered to update routing and forwarding tables. Because of invalid tables during update-time, transient loop may occur and packet-drop rate and end-to-end delay increase which means that the quality of service decreases. This paper proposes a parallel architecture for a router to recalculate and update routing table. Simulation results show that with dual-processor architecture, this update time could be up to 40% improved, depending on the network topology and the size of tables. This paper also studies the effect of this speed-up on networks' performability, i.e. the ability to deliver services at predefined level. A...
Article 8th International Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN 2005, Las Vegas, NV, 7 December 2005 through 9 December 2005 ; Volume 2005 , 2005 , Pages 40-45 ; 0769525091 (ISBN); 9780769525099 (ISBN) ; Sarbazi Azad, H ; Sharif University of Technology
We study a class of interconnection networks for multiprocessors, called the Necklace-G network that is based on the base graph G by attaching an array of processors to each two adjacent nodes of G. One of the interesting features of the proposed topology is its scalability while preserving most of the desirable properties of the underlying base network G. We conduct a general study on the topological properties of necklace networks. We first obtain their basic topological parameters, and then present optimal routing and broadcasting algorithms. We also present a unified approach to obtain the topological properties and the VLSI-layout of an arbitrary necklace network based on the properties...
Article Proceedings - 17th CSI International Symposium on Computer Architecture and Digital Systems, CADS 2013 ; October , 2013 , Pages 3-8 ; 9781479905621 (ISBN) ; Abdi, M ; Baniasadi, A ; Hessabi, S ; Computer Society of Iran; IPM ; Sharif University of Technology
IEEE Computer Society 2013
The Graphics Processing Unit (GPU) is the most promising candidate platform for faster rate of improvement in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics processors requires addressing several architecture challenges. In this paper, we focus on improving performance by better hiding long waiting time to transfer data from the slow global memory. Thereupon study an effective light-overhead prefetching mechanism, which utilizes idle processing elements. Our results show that we can potentially improve...
Article Microprocessors and Microsystems ; Volume 46 , 2016 , Pages 264-273 ; 01419331 (ISSN) ; Samavatian, M. H ; Sarbazi Azad, H ; Sharif University of Technology
Elsevier B.V 2016
Spatial multi-programming is one of the most efficient multi-programming methods on Graphics Processing Units (GPUs). This multi-programming scheme generates variety in resource requirements of stream multiprocessors (SMs) and creates opportunities for sharing unused portions of each SM resource with other SMs. Although this approach drastically improves GPU performance, in some cases it leads to performance degradation due to the shortage of allocated resource to each program. Considering shared-memory as one of the main bottlenecks of thread-level parallelism (TLP), in this paper, we propose an adaptive shared-memory sharing architecture, called ASHA. ASHA enhances spatial...
LEXACT: low energy n-modular redundancy using approximate computing for real-time multicore processors, Article IEEE Transactions on Emerging Topics in Computing ; 2017 ; 21686750 (ISSN) ; Ghassem Miremadi, S ; Sharif University of Technology
Multicore processors are becoming popular in safety-critical applications. A series of these applications comprises of kernels where inexact computations may produce results within the boundary of sufficient quality though, for which the reliability should stay at the maximum possible level. Intrinsic core-level redundancy in multicore processors can be leveraged to achieve the desired reliability level in form of N-modular redundancy (NMR). While NMR provides a proactive means of reliability for critical systems, it has two main drawbacks: Increase in the area and energy consumption that are both limiting factors in the embedded systems. This paper presents a software-based method to...
Article 20th Design, Automation and Test in Europe, DATE 2017, 27 March 2017 through 31 March 2017 ; 2017 , Pages 31-36 ; 9783981537093 (ISBN) ; Mirhosseini, A ; Roozkhosh, S ; Bakhishi, H ; Sarbazi Azad, H ; ACM Special Interest Group on Design Automation (ACM SIGDA); Electronic System Design Alliance (ESDA); et al.; European Design and Automation Association (EDAA); European Electronic Chips and Systems Design Initiative (ECSI); IEEE Council on Electronic Design Automation (CEDA) ; Sharif University of Technology
Institute of Electrical and Electronics Engineers Inc 2017
The placement of the Last Level Cache (LLC) banks in the GPU on-chip network can significantly affect the performance of memory-intensive workloads. In this paper, we attempt to offer a placement methodology for the LLC banks to maximize the performance of the on-chip network connecting the LLC banks to the streaming multiprocessors in GPUs. We argue that an efficient placement needs to be derived based on a novel metric that considers the latency hiding capability of the GPUs through thread level parallelism. To this end, we propose a throughput aware metric, called Effective Latency Impact (ELI). Moreover, we define an optimization problem to formulate our placement approach based on the...
LTRF: enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching, Article 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, 24 March 2018 through 28 March 2018 ; 2018 , Pages 489-502 ; 9781450349116 (ISBN) ; Mirhosseini, A ; Ehsani, S. B ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology
Association for Computing Machinery 2018
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file, to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical...
Article SPAA'07: 19th Annual Symposium on Parallelism in Algorithms and Architectures, San Diego, CA, 9 June 2007 through 11 June 2007 ; 2007 , Pages 46-54 ; 159593667X (ISBN); 9781595936677 (ISBN) ; Ghodsi, M ; Hajiaghayi, M. T ; Sayedi Roshkhar, A. S ; Zadimoghaddam, M ; Sharif University of Technology
This paper considers scheduling tasks while minimizing the power consumption of one or more processors, each of which can go to sleep at a fixed cost α. There are two natural versions of this problem, both considered extensively in recent work: minimize the total power consumption (including computation time), or minimize the number of "gaps" in execution. For both versions in a multiprocessor system, we develop a polynomial-time algorithm based on sophisticated dynamic programming. In a generalization of the power-saving problem, where each task can execute in any of a specified set of time intervals, we develop a (1 + 2
3 α)-approximation, and show that dependence on α is necessary....
Article 17th Great Lakes Symposium on VLSI, GLSVLSI'07, Stresa-Lago Maggiore, 11 March 2007 through 13 March 2007 ; 2007 , Pages 329-334 ; 159593605X (ISBN); 9781595936059 (ISBN) ; Najafvand, M ; Hessabi, S ; Goudarzi, M ; Sharif University of Technology
In this paper, we present a JPEG decoder implemented in our ODYSSEY design methodology. We start with an object-oriented JPEG decoder model. The total operation from modeling to implementation is done automatically by our EDA tool-set in about 10 hours. The resultant system is a JPEG decoder ASIP whose hardware part is implemented on FPGA logic blocks and software part runs on a MicroBlaze processor. This ASIP can be extended by software routines to implement the motion JPEG or MPEG2 decoding algorithms. We implemented our system on ML402 FPGA-based prototype board. Experimental results show that our ASIP implementation is comparable to other approaches while our approach enables quick and...
Article 2006 Canadian Conference on Electrical and Computer Engineering, CCECE'06, Ottawa, ON, 7 May 2006 through 10 May 2006 ; 2006 , Pages 959-962 ; 08407789 (ISSN); 1424400384 (ISBN); 9781424400386 (ISBN) ; Hessabi, S ; Goudarzi, M ; Sharif University of Technology
Institute of Electrical and Electronics Engineers Inc 2006
A reconfigurable cache architecture for object-oriented application-specific instruction set processors (ASIP) is presented in this paper. The embedded ASIPs we follow in this research are specifically designed to suit object-oriented applications and are synthesized form an object-oriented highlevel specification. The ASIPs are composed of a processor core along with a number of hardware functional units. In order to support concurrent execution of the functional units, we propose a cache architecture which is virtually divided into a number of partitions. The partition sizes can be dynamically changed depending on the run-time behavior of the application. Partitioning the cache not only...
Article Microelectronics Reliability ; Volume 46, Issue 1 , 2006 , Pages 124-133 ; 00262714 (ISSN) ; Miremadi, S. G ; Sharif University of Technology
This paper presents a software-based error detection scheme called enhanced committed instructions counting (ECIC) for embedded and real-time systems using commercial off-the-shelf (COTS) processors. The scheme uses the internal performance monitoring features of a processor, which provides the ability to count the number of committed instructions in a program. To evaluate the ECIC scheme, 6000 software induced faults are injected into a 32-bit Pentium® processor. The results show that the error detection coverage varies between 90.52% and 98.18%, for different workloads. © 2004 Elsevier Ltd. All rights reserved
Article Third IEEE International Workshop on Electronic Design, Test and Applications, DELTA 2006, Kuala Lumpur, 17 January 2006 through 19 January 2006 ; Volume 2006 , 2006 , Pages 249-254 ; 0769525008 (ISBN); 9780769525006 (ISBN) ; Hessabi, S ; Gudarzi, M ; Sharif University of Technology
A table-based implementation of an application specific data prefetching approach is presented in this paper. This approach is proposed to improve the performance of the application specific instruction-set processors (ASIP) we develop customized to an object-oriented application. In this approach, the cache controller prefetches all data fields of an object required by a class method, when the class method is invoked. In the proposed table-based implementation, the cache controller monitors the class method calls and records the index of object data members that each method accessed. This information is used to prefetch the data items needed by a class method on next invocations of that...
Article ACM Transactions on Computer Systems ; Volume 37, Issue 1-4 , 2021 ; 07342071 (ISSN) ; Mirhosseini, A ; Hajiabadi, A ; Ehsani, S. B ; Falahati, H ; Sarbazi Azad, H ; Drumond, M ; Falsafi, B ; Ausavarungnirun, R ; Mutlu, O ; Sharif University of Technology
Association for Computing Machinery 2021
Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical...
Article 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, DFT 2005, Monterey, CA, 3 October 2005 through 5 October 2005 ; 2005 , Pages 266-274 ; 15505774 (ISSN) ; Farivar, R ; Miremadi, S. G ; Aitken R ; Ito H ; Metra C ; Park N ; Sharif University of Technology
This paper presents a behavior-based error detection technique called Control Flow Checking using Branch Trace Exceptions for PowerPC processors family (CFCBTE). This technique is based on the branch trace exception feature available in the PowerPC processors family for debugging purposes. This technique traces the target addresses of program branches at run-time and compares them with reference target addresses to detect possible violations caused by transient faults. The reference target addresses are derived by a preprocessor from the source program. The proposed technique is experimentally evaluated on a 32-bit PowerPC microcontroller using software implemented fault injection (SWIFI)....
Article 11th Pacific Rim International Symposium on Dependable Computing, PRDC 2005, Changsha, Hunan, 12 December 2005 through 14 December 2005 ; Volume 2005 , 2005 , Pages 83-90 ; 0769524923 (ISBN); 9780769524924 (ISBN) ; Miremadi, S. G ; Sharif University of Technology
To enhance the error detection capability in COTS (commercial off-the-shelf) -based design of safety-critical systems, a new hardware-based control flow checking (CFC) technique will be presented. This technique, Control Flow Checking by Execution Tracing (CFCET), employs the internal execution tracing features available in COTS processors and an external watchdog processor (WDP) to monitor the addresses of taken branches in a program. This is done without any modification of application programs, therefore, the program overhead is zero. The external hardware overhead is about 3.5% using an Altera Flex 10K30 FPGA. For different workload programs, the execution time overhead and the error...
Article 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005, Denver, CO, 4 April 2005 through 8 April 2005 ; Volume 2005 , 2005 ; 0769523129 (ISBN); 0769523129 (ISBN); 9780769523125 (ISBN) ; Sharif University of Technology
In this paper, a parallel algorithm for computing the roots of a given polynomial of degree n on a ring of processors is proposed. The algorithm implements Durand-Kerner's method and consists of two phases: initialization, and iteration. In the initialization phase all the necessary preparation steps are realized to start the parallel computation. It includes register initialization and initial approximation of roots requiring 3n-2 communications, 2 exponentiation, one multiplications, 6 divisions, and 4n-3 additions. In the iteration phase, these initial approximated roots are corrected repeatedly and converge to their accurate values. The iteration phase is composed of some iteration...