Loading...

Reliability Improvement of STT-MRAM Memories in Data Storage Systems

Cheshmikhani, Elham | 2020

500 Viewed
  1. Type of Document: Ph.D. Dissertation
  2. Language: Farsi
  3. Document No: 52905 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Asadi, Hossein; Farbeh, Hamed
  7. Abstract:
  8. Spin-Transfer Torque Magnetic RAM (STT-MRAM) is known as the most promising replacement for SRAM technology in cache memories. Despite its high-density, non-volatility, near-zero leakage power, and immunity to radiation-induced particle strikes as the major advantages, STT-MRAM-based cache memory suffers from high error rates mainly due to retention failure, read disturbance, and write failure. Despite its high-density, non-volatility, near-zero leakage power, and immunity to radiation as the major advantages, STT-MRAM suffers from high error rates. These errors, which are mainly retention failure, read disturbance, and write failure, are the major reliability challenge in STT-MRAM caches. Existing studies are limited to estimate the rate of only one or two of these error types for STT-MRAM cache. However, the overall vulnerability of STT-MRAM caches, which its estimation is a must to design cost-efficient reliable caches has not been offered in none of previous studies. Meanwhile, all of the existing reliability improvement schemes in STT-MRAM caches are limited to overcome a single or two error types and the majority of them have adverse effect on other error types. In this dissertation, we first propose a system-level framework for reliability exploration and characterization of errors behavior in STT-MRAM caches. To this end, we formulate the cache vulnerability considering the inter-correlation of the error types including retention failure, read disturbance, and write failure as well as the dependency of error rates to workloads behavior and Process Variations (PVs). Then, we investigate the effect of temperature on STT-MRAM cache error rate and demonstrate that heat accumulation increases the error rate by 110.9 percent. We also illustrate that this heat accumulation is mainly due to locality of committed write operations in the cache. In addition, we demonstrate that a) extra read accesses to data and tag arrays, which are imposed to enhance the cache access time significantly increase the read disturbance error rate; and b) the diversity in the number of `1's and switching in codewords of a data block significantly degrades the protection capability of error correcting codes. We also propose a new cache architecture, so-called Reliability-Optimized STT-MRAM Memory (ROSTAM), to customize different parts of the cache structure for reliability enhancement. ROSTAM consists of four components: 1) a simple yet effective replacement policy, called TA-LRW, to prevent the heat accumulation in the cache and reduce the rate of all the three error types, 2) a novel tag array structure, so-called 3RSeT to reduce the error rate by eliminating a significant portion of tag reads, 3) an effective scheme, so-called REAP-Cache, to prevent the accumulation of read disturbance in cache blocks and completely eliminate the adverse effect of concealed reads on the cache reliability, and 4) a new ECC configuration, so-called ROBIN, to uniformly distribute the transitions between the codewords and maximize the ECC correction capability. We compare the proposed architecture with an 8-way L2 cache protected by SEC-DED(72,64) and using LRU policy. The experimental results using gem5 full-system simulator and a comprehensive set of multi-programmed workloads from SPEC CPU2006 benchmark suite on a quad-core processor show that: 1) the rate of read disturbance error is reduced by 4966.1x, which is achieved by integrating TA-LRW, 3RSeT, ROBIN, and REAP Cache, 2) write failure is reduced by 3.7x, which is the effect of TA-LRW and ROBIN, 3) retention failure rate is reduced by 8.1x because of TA-LRW and REAP Cache operations, and 4) total error rate considering all error types is reduced by 10x. The significantly reliability enhancement is achieved in the cost of less than 2.7% increase in energy consumption, less than 1% area overhead, and an average of 2.3% performance degradation
  9. Keywords:
  10. Cache Memory ; Reliability ; Read Disturbance ; Spin Transfer Torque-Magnetic (STT-MRAM) ; Write Failure ; Retension Failure

 Digital Object List

 Bookmark

No TOC