Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

The convergence of artificial intelligence, high-performance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework—so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bit-flips with a minimal impact on accuracy convergence

CitacióRojas, E. [et al.]. Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration. A: IEEE International Conference on Cluster Computing. "2021 IEEE International Conference on Cluster Computing (CLUSTER); Portland, OR, USA, 7-10 Sept. 2021: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2021, p. 492-503. ISBN 978-1-7281-9666-4. DOI 10.1109/Cluster48925.2021.00045.

URIhttp://hdl.handle.net/2117/364744

DOI10.1109/Cluster48925.2021.00045

ISBN978-1-7281-9666-4

ISSN2168-9253

Versió de l'editorhttps://ieeexplore.ieee.org/document/9556041

Col·leccions

Computer Sciences - Ponències/Comunicacions de congressos [574]

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
Understanding_S ... _Checkpoint_Alteration.pdf		545,4Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration

Visualitza/Obre

Explora