Mostra el registre d'ítem simple
A checkpoint/restart directive-based approach
dc.contributor | Beltran Querol, Vicenç |
dc.contributor | Ayguadé Parra, Eduard |
dc.contributor.author | Maroñas Bravo, Marcos |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors |
dc.date.accessioned | 2017-02-03T15:43:43Z |
dc.date.available | 2017-02-03T15:43:43Z |
dc.date.issued | 2017 |
dc.identifier.uri | http://hdl.handle.net/2117/100567 |
dc.description.abstract | Exascale platforms require programming models incorporating support for resilience capabilities since the huge number of components they are expected to have is going to increase the number of errors. Checkpoint/restart is a widely used resilience technique due to its robustness and low overhead compared to other techniques. There already exists several solutions implementing this technique, such as FTI or SCR, which focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of: (1) manually serialize and deserialize the application state using a low-level API; (2) modify the natural flow of the application depending whether the current execution is a restart or not; and (3) reimplement their code regarding checkpoint/restart whenever they have to change the backend library. We present a new directive-based approach to performing application-level checkpoint/ restart in a simplified and portable way. We propose a solution based on compiler directives, such as OpenMP ones, that allows users to easily specify the state of the application that has to be saved and restored, leaving the tedious and error-prone serialization and deserialization activities to our intermediate library, which relies on a backend library (FTI/SCR) to perform scalable and efficient I/O operations. Our results, including several benchmarks and two large applications, reveal no extra overhead compared to the direct use of FTI/SCR checkpoint/restart libraries while significantly reducing the effort required by the application developers. |
dc.language.iso | eng |
dc.publisher | Universitat Politècnica de Catalunya |
dc.subject | Àrees temàtiques de la UPC::Informàtica |
dc.subject.lcsh | Programming (Mathematics) |
dc.subject.lcsh | Software engineering |
dc.subject.other | checkpoint |
dc.subject.other | resiliència |
dc.subject.other | Tolerància a fallades |
dc.subject.other | models de programació |
dc.subject.other | resiliency |
dc.subject.other | fault tolerance |
dc.subject.other | programming models |
dc.title | A checkpoint/restart directive-based approach |
dc.title.alternative | Ompss persistent checkpoint/restart: a directive-based approach |
dc.type | Master thesis |
dc.subject.lemac | Programació (Matemàtica) |
dc.subject.lemac | Enginyeria de programari |
dc.identifier.slug | 123240 |
dc.rights.access | Open Access |
dc.date.updated | 2017-02-02T15:42:35Z |
dc.audience.educationlevel | Màster |
dc.audience.mediator | Facultat d'Informàtica de Barcelona |
dc.audience.degree | MÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012) |
dc.contributor.covenantee | Barcelona Supercomputing Center |