FPGA checkpointing for scientific computing
Visualitza/Obre
10.1109/IOLTS52814.2021.9486693
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/374272
Tipus de documentComunicació de congrés
Data publicació2021
EditorInstitute of Electrical and Electronics Engineers (IEEE)
Condicions d'accésAccés obert
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
ProjecteEuroEXA - Co-designed Innovation and System for Resilient Exascale Computing in Europe: From Applications to Silicon (EC-H2020-754337)
eProcessor - European, extendable, energy-efficient, energetic, embedded, extensible, Processor Ecosystem (EC-H2020-956702)
eProcessor - European, extendable, energy-efficient, energetic, embedded, extensible, Processor Ecosystem (EC-H2020-956702)
Abstract
The use of FPGAs in computational workloads is becoming increasingly popular due to the flexibility of these devices in comparison to ASICs, and their low power consumption compared to GPUs and CPUs. However, scientific applications run for long periods of time and the hardware is always subject to failures due to either soft or hard errors. Thus, it is important to protect these long running jobs with fault tolerance mechanisms. Checkpoint-Restart is a popular technique in high-performance computing that allows large scale applications to cope with frequent failures. In this work we approach the fault tolerance of CPU-FPGA heterogeneous applications from a high level by using OmpSs@FPGA environment and a multi-level checkpointing library. We analyse the performance of several different applications and we understand what kind of overheads we can expect from checkpointing computational workloads running on FPGAs. Our results demonstrate overheads as low as 0.16% and 0.66% when checkpointing very frequently, indicating that this technique is efficient and does not add a significant amount of overhead to the system. In addition, we showcase a proof of concept for checkpointing partial data of the FPGA task itself. This can prove useful for workloads in which most data is offloaded to the FPGA memory at once and do not constantly move all the data between the accelerator and the CPU.
CitacióPerelló Bacardit, M.; Bautista Gomez, L.; Unsal, O.S. FPGA checkpointing for scientific computing. A: IEEE Symposium on On-Line Testing (IOLTS). "2021 IEEE 27th International Symposium on On-Line Testing and Robust System Design (IOLTS): 28-30 June 2021, Torino, Italy: proceedings". Institute of Electrical and Electronics Engineers (IEEE), 2021, ISBN 978-1-6654-3370-9. DOI 10.1109/IOLTS52814.2021.9486693.
ISBN978-1-6654-3370-9
ISSN1942-9401
Versió de l'editorhttps://ieeexplore.ieee.org/document/9486693
Col·leccions
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
FPGA_Checkpointing_for_Scientific_Computing.pdf | 377,9Kb | Visualitza/Obre |