Mostra el registre d'ítem simple

dc.contributor.authorŽivanovič, Darko
dc.contributor.authorEsmaili Dokht, Pouya
dc.contributor.authorMoré, Sergi
dc.contributor.authorBartolomé, Javier
dc.contributor.authorCarpenter, Paul Matthew
dc.contributor.authorRadojković, Petar
dc.contributor.authorAyguadé Parra, Eduard
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor.otherBarcelona Supercomputing Center
dc.date.accessioned2020-05-06T13:38:50Z
dc.date.available2020-05-06T13:38:50Z
dc.date.issued2019
dc.identifier.citationZivanovic, D. [et al.]. DRAM errors in the field: a statistical approach. A: International Symposium on Memory Systems. "MEMSYS 2019: proceedings of the International Symposium on Memory Systems: Washington DC, September 30–October 3, 2019". New York: Association for Computing Machinery (ACM), 2019, p. 69-84.
dc.identifier.isbn978-1-4503-7206-0
dc.identifier.urihttp://hdl.handle.net/2117/186553
dc.description.abstractThis paper summarizes our two-year study of corrected and uncor-rected errors on the MareNostrum 3 supercomputer, covering 2000 billion MB-hours of DRAM in the field. The study analyzes 4.5 million corrected and 71 uncorrected DRAM errors and it compares the reliability of DIMMs from all three major memory manufacturers, built in three different technologies. Our work has two sets of contributions. First, we illustrate the complexity of in-field DRAM error analysis and demonstrate the limitations of various widely-used methods and metrics. For example, we show that average error rates, errors per MB-hour and mean time between failures can provide volatile and unreliable results even after long periods of error logging, leading to incorrect conclusions about DRAM reliability. Second, we present formal statistical methods that overcome many of the limitations of the current approaches. The methods that we present are simple to understand and implement, reliable and widely accepted in the statistical community. Overall, our study alerts the community about the need to, firstly, question the current practice in quantifying DRAM reliability and, secondly, to select a proper analysis approach for future studies. Our strong recommendations are to focus on metrics with a practical value that could be easily related to system reliability, and to select methods that provide stable results, ideally supported with statistical significance.
dc.description.sponsorshipThis work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under EuroEXA project (grant agreement No 754337). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain.
dc.format.extent16 p.
dc.language.isoeng
dc.publisherAssociation for Computing Machinery (ACM)
dc.subjectÀrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject.lcshHardware -- Reliability
dc.subject.lcshHigh performance computing
dc.subject.lcshSupercomputers
dc.subject.otherMemory
dc.subject.otherLarge-scale systems
dc.subject.otherStatistical analysis
dc.subject.otherMareNostrum 3
dc.titleDRAM errors in the field: a statistical approach
dc.typeConference report
dc.subject.lemacOrdinadors -- Fiabilitat
dc.subject.lemacCàlcul intensiu (Informàtica)
dc.subject.lemacSuperordinadors
dc.contributor.groupUniversitat Politècnica de Catalunya. ARCO - Microarquitectura i Compiladors
dc.contributor.groupUniversitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.identifier.doi10.1145/3357526.3357558
dc.relation.publisherversionhttps://dl.acm.org/doi/10.1145/3357526.3357558
dc.rights.accessOpen Access
local.identifier.drac27852356
dc.description.versionPostprint (author's final draft)
dc.relation.projectidinfo:eu-repo/grantAgreement/EC/H2020/754337/EU/Co-designed Innovation and System for Resilient Exascale Computing in Europe: From Applications to Silicon/EuroEXA
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/
dc.relation.projectidinfo:eu-repo/grantAgreement/AGAUR/V PRI/2014 SGR 1051
dc.relation.projectidinfo:eu-repo/grantAgreement/AGAUR/V PRI/2014 SGR 1272
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO//SEV-2015-0493/ES/BARCELONA SUPERCOMPUTING CENTER - CENTRO. NACIONAL DE SUPERCOMPUTACION/
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO//SVP-2014-068501/ES/SVP-2014-068501/
local.citation.authorZivanovic, D.; Esmaili, P.; Moré, S.; Bartolomé, J.; Carpenter, P.; Radojkovic, P.; Ayguadé, E.
local.citation.contributorInternational Symposium on Memory Systems
local.citation.pubplaceNew York
local.citation.publicationNameMEMSYS 2019: proceedings of the International Symposium on Memory Systems: Washington DC, September 30–October 3, 2019
local.citation.startingPage69
local.citation.endingPage84


Fitxers d'aquest items

Thumbnail

Aquest ítem apareix a les col·leccions següents

Mostra el registre d'ítem simple