A directive-based approach to perform persistent checkpoint/restart
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessRestricted access - publisher's policy
Exascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and deserialization activities to our library, which relies on SCR/FTI to perform scalable and efficient I/O operations. Our results, based on several benchmarks and two large applications, reveal no additional overhead compared to the direct use of FTI and SCR checkpoint/restart libraries. Apart from that, our portable approach significantly increases the programmability reducing the number of code lines required to perform checkpoint/restart in an average of ˜ 82% and ˜ 94%, for FTI and SCR respectively.
CitationMaroñas, M., Mateo, S., Beltran, V., Ayguade, E. A directive-based approach to perform persistent checkpoint/restart. A: International Conference on High Performance Computing and Simulation. "HPCS 2017: 2017 International Conference on High Performance Computing & Simulation: proceedings: 17-21 July 2017: Genoa, Italy". Genoa: Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 442-451.
|A Directive-Bas ... Persistent Checkpoint.pdf||306,6Kb||Restricted access|