A checkpoint/restart directive-based approach

Ver/Abrir
Estadisticas de LA Referencia / Recolecta
Incluye datos de uso desde 2022
Cita com:
hdl:2117/100567
Realizado en/conBarcelona Supercomputing Center
Tipo de documentoProjecte Final de Màster Oficial
Fecha2017
Condiciones de accesoAcceso abierto
Todos los derechos reservados. Esta obra
está protegida por los derechos de propiedad intelectual e industrial. Sin perjuicio de las exenciones legales
existentes, queda prohibida su reproducción, distribución, comunicación pública o transformación sin la
autorización de la persona titular de los derechos
Resumen
Exascale platforms require programming models incorporating support for resilience
capabilities since the huge number of components they are expected to have is going to
increase the number of errors.
Checkpoint/restart is a widely used resilience technique due to its robustness and low
overhead compared to other techniques. There already exists several solutions implementing
this technique, such as FTI or SCR, which focus mainly on providing advanced
I/O capabilities to minimize checkpoint/restart time. However, application developers
are still in charge of: (1) manually serialize and deserialize the application state using
a low-level API; (2) modify the natural flow of the application depending whether the
current execution is a restart or not; and (3) reimplement their code regarding checkpoint/restart
whenever they have to change the backend library.
We present a new directive-based approach to performing application-level checkpoint/
restart in a simplified and portable way. We propose a solution based on compiler
directives, such as OpenMP ones, that allows users to easily specify the state of the
application that has to be saved and restored, leaving the tedious and error-prone serialization
and deserialization activities to our intermediate library, which relies on a
backend library (FTI/SCR) to perform scalable and efficient I/O operations.
Our results, including several benchmarks and two large applications, reveal no extra
overhead compared to the direct use of FTI/SCR checkpoint/restart libraries while
significantly reducing the effort required by the application developers.
MateriasProgramming (Mathematics), Software engineering, Programació (Matemàtica), Enginyeria de programari
TitulaciónMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)
Ficheros | Descripción | Tamaño | Formato | Ver |
---|---|---|---|---|
123240.pdf | 2,452Mb | Ver/Abrir |