A checkpoint/restart directive-based approach
CovenanteeBarcelona Supercomputing Centre
Document typeMaster thesis
Rights accessOpen Access
Exascale platforms require programming models incorporating support for resilience capabilities since the huge number of components they are expected to have is going to increase the number of errors. Checkpoint/restart is a widely used resilience technique due to its robustness and low overhead compared to other techniques. There already exists several solutions implementing this technique, such as FTI or SCR, which focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of: (1) manually serialize and deserialize the application state using a low-level API; (2) modify the natural flow of the application depending whether the current execution is a restart or not; and (3) reimplement their code regarding checkpoint/restart whenever they have to change the backend library. We present a new directive-based approach to performing application-level checkpoint/ restart in a simplified and portable way. We propose a solution based on compiler directives, such as OpenMP ones, that allows users to easily specify the state of the application that has to be saved and restored, leaving the tedious and error-prone serialization and deserialization activities to our intermediate library, which relies on a backend library (FTI/SCR) to perform scalable and efficient I/O operations. Our results, including several benchmarks and two large applications, reveal no extra overhead compared to the direct use of FTI/SCR checkpoint/restart libraries while significantly reducing the effort required by the application developers.