Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration
10.1109/Cluster48925.2021.00045
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/364744
Tipus de documentComunicació de congrés
Data publicació2021
EditorInstitute of Electrical and Electronics Engineers (IEEE)
Condicions d'accésAccés obert
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
Abstract
The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework—so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence
CitacióRojas, E. [et al.]. Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration. A: IEEE International Conference on Cluster Computing. "2021 IEEE International Conference on Cluster Computing (CLUSTER); Portland, OR, USA, 7-10 Sept. 2021: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2021, p. 492-503. ISBN 978-1-7281-9666-4. DOI 10.1109/Cluster48925.2021.00045.
ISBN978-1-7281-9666-4
ISSN2168-9253
Versió de l'editorhttps://ieeexplore.ieee.org/document/9556041
Col·leccions
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
Understanding_S ... _Checkpoint_Alteration.pdf | 545,4Kb | Visualitza/Obre |