DRAM errors in the field: a statistical approach
Document typeConference report
PublisherAssociation for Computing Machinery (ACM)
Rights accessOpen Access
European Commission's projectEuroEXA - Co-designed Innovation and System for Resilient Exascale Computing in Europe: From Applications to Silicon (EC-H2020-754337)
This paper summarizes our two-year study of corrected and uncor-rected errors on the MareNostrum 3 supercomputer, covering 2000 billion MB-hours of DRAM in the field. The study analyzes 4.5 million corrected and 71 uncorrected DRAM errors and it compares the reliability of DIMMs from all three major memory manufacturers, built in three different technologies. Our work has two sets of contributions. First, we illustrate the complexity of in-field DRAM error analysis and demonstrate the limitations of various widely-used methods and metrics. For example, we show that average error rates, errors per MB-hour and mean time between failures can provide volatile and unreliable results even after long periods of error logging, leading to incorrect conclusions about DRAM reliability. Second, we present formal statistical methods that overcome many of the limitations of the current approaches. The methods that we present are simple to understand and implement, reliable and widely accepted in the statistical community. Overall, our study alerts the community about the need to, firstly, question the current practice in quantifying DRAM reliability and, secondly, to select a proper analysis approach for future studies. Our strong recommendations are to focus on metrics with a practical value that could be easily related to system reliability, and to select methods that provide stable results, ideally supported with statistical significance.
CitationZivanovic, D. [et al.]. DRAM errors in the field: a statistical approach. A: International Symposium on Memory Systems. "MEMSYS 2019: proceedings of the International Symposium on Memory Systems: Washington DC, September 30–October 3, 2019". New York: Association for Computing Machinery (ACM), 2019, p. 69-84.
- ARCO - Microarquitectura i Compiladors - Ponències/Comunicacions de congressos 
- Computer Sciences - Ponències/Comunicacions de congressos 
- CAP - Grup de Computació d'Altes Prestacions - Ponències/Comunicacions de congressos 
- Departament d'Arquitectura de Computadors - Ponències/Comunicacions de congressos [1.662]
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder