Towards Ad Hoc Recovery for Soft Errors

Cita com:
hdl:2117/133802
Tipo de documentoComunicación de congreso
Fecha de publicación2018-12-06
EditorIEEE
Condiciones de accesoAcceso abierto
ProyectoDURO - DURO: Deep-memory Ubiquity, Reliability and Optimization (EC-H2020-708566)
BES-2014-068066 (MINECO-BES-2014-068066)
BES-2014-068066 (MINECO-BES-2014-068066)
Resumen
The coming exascale era is a great opportunity for high performance computing (HPC) applications. However, high failure rates on these systems will hazard the successful completion of their execution. Bit-flip errors in dynamic random access memory (DRAM) account for a noticeable share of the failures in supercomputers. Hardware mechanisms, such as error correcting code (ECC), can detect and correct single-bit errors and can detect some multi-bit errors while others can go undiscovered. Unfortunately, detected multi-bit errors will most of the time force the termination of the application and lead to a global restart. Thus, other strategies at the software level are needed to tolerate these type of faults more efficiently and to avoid a global restart. In this work, we extend the FTI checkpointing library to facilitate the implementation of custom recovery strategies for MPI applications, minimizing the overhead introduced when coping with soft errors. The new functionalities are evaluated by implementing local forward recovery on three HPC benchmarks with different reliability requirements. Our results demonstrate a reduction on the recovery times by up to 14%.
CitaciónLosada, N. [et al.]. Towards Ad Hoc Recovery for Soft Errors. A: "2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)". IEEE, 2018, p. 1-10.
ISBN978-1-7281-0222-1
Versión del editorhttps://ieeexplore.ieee.org/document/8564482
Colecciones
Ficheros | Descripción | Tamaño | Formato | Ver |
---|---|---|---|---|
Towards Ad Hoc Recovery For Soft Errors.pdf | 363,6Kb | Ver/Abrir |
Todos los derechos reservados.Esta obra
está protegida por los derechos de propiedad intelectual e industrial. Sin perjuicio de las exenciones legales
existentes, queda prohibida su reproducción, distribución, comunicación pública o transformación sin la
autorización del titular de los derechos