Runtime-guided ECC protection using online estimation of memory vulnerability
Cita com:
hdl:2117/340654
Document typeConference report
Defense date2020
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial
property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public
communication or transformation of this work are prohibited without permission of the copyright holder
ProjectCOMPUTACION DE ALTAS PRESTACIONES VII (MINECO-TIN2015-65316-P)
ROMOL - Riding on Moore's Law (EC-FP7-321253)
ROMOL - Riding on Moore's Law (EC-FP7-321253)
Abstract
Diminishing reliability of semiconductor technologies and decreasing power budgets per component hinder designing next-generation high performance computing (HPC) systems. Both constraints strongly impact memory subsystems, as DRAM main memory accounts for up to 30 to 50 percent of a node’s overall power consumption, and is the subsystem that is most subject to faults. Improving reliability requires stronger error correcting codes (ECCs), which incur additional power and storage costs. It is critical to develop strategies to uphold memory reliability while minimising these costs, with the goal of improving the power efficiency of computing machines.We introduce a methodology to dynamically estimate the vulnerability of data, and adjust ECC protection accordingly. Our methodology relies on information readily available to runtime systems in task-based dataflow programming models, and the existing Virtualized Error Correcting Code (VECC) schemes to provide adaptable protection. Guiding VECC using vulnerability estimates offers a wide range of reliabilityredundancy trade-offs, as reliable as using expensive offline profiling for guidance and up to to 25% safer than VECC without guidance. Runtime-guided VECC is more efficient than a stronger uniform ECC, reducing DIMM lifetime failure from 1.84% down to 1.26% while increasing DRAM energy consumption by only 1.03×.
CitationJaulmes, L. [et al.]. Runtime-guided ECC protection using online estimation of memory vulnerability. A: International Conference for High Performance Computing, Networking, Storage and Analysis. "Proceedings of SC20: The International Conference for High Performance Computing, Networking, Storage and Analysis: Virtual Event, November 9-19, 2020". Institute of Electrical and Electronics Engineers (IEEE), 2020, p. 1-14. ISBN 978-1-7281-9998-6. DOI 10.1109/SC41405.2020.00080.
ISBN978-1-7281-9998-6
Publisher versionhttps://ieeexplore.ieee.org/document/9355313
Collections
- Doctorat en Arquitectura de Computadors - Ponències/Comunicacions de congressos [311]
- Computer Sciences - Ponències/Comunicacions de congressos [597]
- CAP - Grup de Computació d'Altes Prestacions - Ponències/Comunicacions de congressos [784]
- Departament d'Arquitectura de Computadors - Ponències/Comunicacions de congressos [1.976]
Files | Description | Size | Format | View |
---|---|---|---|---|
runtime_guided_ecc_protection_sc_20.pdf | 741,6Kb | View/Open |