Designs for increasing reliability while reducing energy and increasing lifetime
ColaboratorCristal Kestelman, Adrián; Unsal, Osman; ; Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
Document typeDoctoral thesis
PublisherUniversitat Politècnica de Catalunya
Rights accessOpen Access
In the last decades, the computing technology experienced tremendous developments. For instance, transistors' feature size shrank to half at every two years as consistently from the first time Moore stated his law. Consequently, number of transistors and core count per chip doubles at each generation. Similarly, petascale systems that have the capability of processing more than one billion calculation per second have been developed. As a matter of fact, exascale systems are predicted to be available at year 2020. However, these developments in computer systems face a reliability wall. For instance, transistor feature sizes are getting so small that it becomes easier for high-energy particles to temporarily flip the state of a memory cell from 1-to-0 or 0-to-1. Also, even if we assume that fault-rate per transistor stays constant with scaling, the increase in total transistor and core count per chip will significantly increase the number of faults for future desktop and exascale systems. Moreover, circuit ageing is exacerbated due to increased manufacturing variability and thermal stresses, therefore, lifetime of processor structures are becoming shorter. On the other side, due to the limited power budget of the computer systems such that mobile devices, it is attractive to scale down the voltage. However, when the voltage level scales to beyond the safe margin especially to the ultra-low level, the error rate increases drastically. Nevertheless, new memory technologies such as NAND flashes present only limited amount of nominal lifetime, and when they exceed this lifetime, they can not guarantee storing of the data correctly leading to data retention problems. Due to these issues, reliability became a first-class design constraint for contemporary computing in addition to power and performance. Moreover, reliability even plays increasingly important role when computer systems process sensitive and life-critical information such as health records, financial information, power regulation, transportation, etc. In this thesis, we present several different reliability designs for detecting and correcting errors occurring in processor pipelines, L1 caches and non-volatile NAND flash memories due to various reasons. We design reliability solutions in order to serve three main purposes. Our first goal is to improve the reliability of computer systems by detecting and correcting random and non-predictable errors such as bit flips or ageing errors. Second, we aim to reduce the energy consumption of the computer systems by allowing them to operate reliably at ultra-low voltage level. Third, we target to increase the lifetime of new memory technologies by implementing efficient and low-cost reliability schemes.
- Tesis - TDX-UPC