Mostra el registre d'ítem simple

dc.contributorBeltran Querol, Vicenç
dc.contributorAyguadé Parra, Eduard
dc.contributor.authorMaroñas Bravo, Marcos
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.date.accessioned2017-02-03T15:43:43Z
dc.date.available2017-02-03T15:43:43Z
dc.date.issued2017
dc.identifier.urihttp://hdl.handle.net/2117/100567
dc.description.abstractExascale platforms require programming models incorporating support for resilience capabilities since the huge number of components they are expected to have is going to increase the number of errors. Checkpoint/restart is a widely used resilience technique due to its robustness and low overhead compared to other techniques. There already exists several solutions implementing this technique, such as FTI or SCR, which focus mainly on providing advanced I/O capabilities to minimize checkpoint/restart time. However, application developers are still in charge of: (1) manually serialize and deserialize the application state using a low-level API; (2) modify the natural flow of the application depending whether the current execution is a restart or not; and (3) reimplement their code regarding checkpoint/restart whenever they have to change the backend library. We present a new directive-based approach to performing application-level checkpoint/ restart in a simplified and portable way. We propose a solution based on compiler directives, such as OpenMP ones, that allows users to easily specify the state of the application that has to be saved and restored, leaving the tedious and error-prone serialization and deserialization activities to our intermediate library, which relies on a backend library (FTI/SCR) to perform scalable and efficient I/O operations. Our results, including several benchmarks and two large applications, reveal no extra overhead compared to the direct use of FTI/SCR checkpoint/restart libraries while significantly reducing the effort required by the application developers.
dc.language.isoeng
dc.publisherUniversitat Politècnica de Catalunya
dc.subjectÀrees temàtiques de la UPC::Informàtica
dc.subject.lcshProgramming (Mathematics)
dc.subject.lcshSoftware engineering
dc.subject.othercheckpoint
dc.subject.otherresiliència
dc.subject.otherTolerància a fallades
dc.subject.othermodels de programació
dc.subject.otherresiliency
dc.subject.otherfault tolerance
dc.subject.otherprogramming models
dc.titleA checkpoint/restart directive-based approach
dc.title.alternativeOmpss persistent checkpoint/restart: a directive-based approach
dc.typeMaster thesis
dc.subject.lemacProgramació (Matemàtica)
dc.subject.lemacEnginyeria de programari
dc.identifier.slug123240
dc.rights.accessOpen Access
dc.date.updated2017-02-02T15:42:35Z
dc.audience.educationlevelMàster
dc.audience.mediatorFacultat d'Informàtica de Barcelona
dc.audience.degreeMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)
dc.contributor.covenanteeBarcelona Supercomputing Center


Fitxers d'aquest items

Thumbnail

Aquest ítem apareix a les col·leccions següents

Mostra el registre d'ítem simple