A checkpoint/restart directive-based approach

Maroñas Bravo, Marcos

Visualitza/Obre

123240.pdf (2,452Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Maroñas Bravo, Marcos

Tutor / directorBeltran Querol, Vicenç; Ayguadé Parra, Eduard

Realitzat a/ambBarcelona Supercomputing Center

Tipus de documentProjecte Final de Màster Oficial

Data2017

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

Exascale platforms require programming models incorporating support for resilience capabilities since the huge number of components they are expected to have is going to increase the number of errors. Checkpoint/restart is a widely used resilience technique due to its robustness and low overhead compared to other techniques. There already exists several solutions implementing this technique, such as FTI or SCR, which focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of: (1) manually serialize and deserialize the application state using a low-level API; (2) modify the natural flow of the application depending whether the current execution is a restart or not; and (3) reimplement their code regarding checkpoint/restart whenever they have to change the backend library. We present a new directive-based approach to performing application-level checkpoint/ restart in a simplified and portable way. We propose a solution based on compiler directives, such as OpenMP ones, that allows users to easily specify the state of the application that has to be saved and restored, leaving the tedious and error-prone serialization and deserialization activities to our intermediate library, which relies on a backend library (FTI/SCR) to perform scalable and efficient I/O operations. Our results, including several benchmarks and two large applications, reveal no extra overhead compared to the direct use of FTI/SCR checkpoint/restart libraries while significantly reducing the effort required by the application developers.

MatèriesProgramming (Mathematics), Software engineering, Programació (Matemàtica), Enginyeria de programari

TitulacióMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)

URIhttp://hdl.handle.net/2117/100567

Col·leccions

Màsters oficials - Master in Innovation and Research in Informatics - MIRI [454]

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
123240.pdf		2,452Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

A checkpoint/restart directive-based approach

Visualitza/Obre

Explora