Mostra el registre d'ítem simple
DRAM errors in the field: a statistical approach
dc.contributor.author | Živanovič, Darko |
dc.contributor.author | Esmaili Dokht, Pouya |
dc.contributor.author | Moré, Sergi |
dc.contributor.author | Bartolomé, Javier |
dc.contributor.author | Carpenter, Paul Matthew |
dc.contributor.author | Radojković, Petar |
dc.contributor.author | Ayguadé Parra, Eduard |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors |
dc.contributor.other | Barcelona Supercomputing Center |
dc.date.accessioned | 2020-05-06T13:38:50Z |
dc.date.available | 2020-05-06T13:38:50Z |
dc.date.issued | 2019 |
dc.identifier.citation | Zivanovic, D. [et al.]. DRAM errors in the field: a statistical approach. A: International Symposium on Memory Systems. "MEMSYS 2019: proceedings of the International Symposium on Memory Systems: Washington DC, September 30–October 3, 2019". New York: Association for Computing Machinery (ACM), 2019, p. 69-84. |
dc.identifier.isbn | 978-1-4503-7206-0 |
dc.identifier.uri | http://hdl.handle.net/2117/186553 |
dc.description.abstract | This paper summarizes our two-year study of corrected and uncor-rected errors on the MareNostrum 3 supercomputer, covering 2000 billion MB-hours of DRAM in the field. The study analyzes 4.5 million corrected and 71 uncorrected DRAM errors and it compares the reliability of DIMMs from all three major memory manufacturers, built in three different technologies. Our work has two sets of contributions. First, we illustrate the complexity of in-field DRAM error analysis and demonstrate the limitations of various widely-used methods and metrics. For example, we show that average error rates, errors per MB-hour and mean time between failures can provide volatile and unreliable results even after long periods of error logging, leading to incorrect conclusions about DRAM reliability. Second, we present formal statistical methods that overcome many of the limitations of the current approaches. The methods that we present are simple to understand and implement, reliable and widely accepted in the statistical community. Overall, our study alerts the community about the need to, firstly, question the current practice in quantifying DRAM reliability and, secondly, to select a proper analysis approach for future studies. Our strong recommendations are to focus on metrics with a practical value that could be easily related to system reliability, and to select methods that provide stable results, ideally supported with statistical significance. |
dc.description.sponsorship | This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under EuroEXA project (grant agreement No 754337). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain. |
dc.format.extent | 16 p. |
dc.language.iso | eng |
dc.publisher | Association for Computing Machinery (ACM) |
dc.subject | Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors |
dc.subject.lcsh | Hardware -- Reliability |
dc.subject.lcsh | High performance computing |
dc.subject.lcsh | Supercomputers |
dc.subject.other | Memory |
dc.subject.other | Large-scale systems |
dc.subject.other | Statistical analysis |
dc.subject.other | MareNostrum 3 |
dc.title | DRAM errors in the field: a statistical approach |
dc.type | Conference report |
dc.subject.lemac | Ordinadors -- Fiabilitat |
dc.subject.lemac | Càlcul intensiu (Informàtica) |
dc.subject.lemac | Superordinadors |
dc.contributor.group | Universitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors |
dc.contributor.group | Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
dc.identifier.doi | 10.1145/3357526.3357558 |
dc.relation.publisherversion | https://dl.acm.org/doi/10.1145/3357526.3357558 |
dc.rights.access | Open Access |
local.identifier.drac | 27852356 |
dc.description.version | Postprint (author's final draft) |
dc.relation.projectid | info:eu-repo/grantAgreement/EC/H2020/754337/EU/Co-designed Innovation and System for Resilient Exascale Computing in Europe: From Applications to Silicon/EuroEXA |
dc.relation.projectid | info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/ |
dc.relation.projectid | info:eu-repo/grantAgreement/AGAUR/V PRI/2014 SGR 1051 |
dc.relation.projectid | info:eu-repo/grantAgreement/AGAUR/V PRI/2014 SGR 1272 |
dc.relation.projectid | info:eu-repo/grantAgreement/MINECO//SEV-2015-0493/ES/BARCELONA SUPERCOMPUTING CENTER - CENTRO. NACIONAL DE SUPERCOMPUTACION/ |
dc.relation.projectid | info:eu-repo/grantAgreement/MINECO//SVP-2014-068501/ES/SVP-2014-068501/ |
local.citation.author | Zivanovic, D.; Esmaili, P.; Moré, S.; Bartolomé, J.; Carpenter, P.; Radojkovic, P.; Ayguadé, E. |
local.citation.contributor | International Symposium on Memory Systems |
local.citation.pubplace | New York |
local.citation.publicationName | MEMSYS 2019: proceedings of the International Symposium on Memory Systems: Washington DC, September 30–October 3, 2019 |
local.citation.startingPage | 69 |
local.citation.endingPage | 84 |