Show simple item record

dc.contributor.authorMaroñas, Marcos
dc.contributor.authorMateo, Sergi
dc.contributor.authorBeltran Querol, Vicenç
dc.contributor.authorAyguadé Parra, Eduard
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor.otherBarcelona Supercomputing Center
dc.date.accessioned2017-09-22T13:52:38Z
dc.date.issued2017
dc.identifier.citationMaroñas, M., Mateo, S., Beltran, V., Ayguade, E. A directive-based approach to perform persistent checkpoint/restart. A: International Conference on High Performance Computing and Simulation. "HPCS 2017: 2017 International Conference on High Performance Computing & Simulation: proceedings: 17-21 July 2017: Genoa, Italy". Genoa: Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 442-451.
dc.identifier.isbn978-1-5386-3249-9
dc.identifier.urihttp://hdl.handle.net/2117/107925
dc.description.abstractExascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and deserialization activities to our library, which relies on SCR/FTI to perform scalable and efficient I/O operations. Our results, based on several benchmarks and two large applications, reveal no additional overhead compared to the direct use of FTI and SCR checkpoint/restart libraries. Apart from that, our portable approach significantly increases the programmability reducing the number of code lines required to perform checkpoint/restart in an average of ˜ 82% and ˜ 94%, for FTI and SCR respectively.
dc.description.sponsorshipThe research leading to these results has received funding from the European Community Seventh Framework Programme (FP7/2007-2013) via the DEEP-ER project under Grant Agreement number 610476. This work has been also supported by the Spanish Ministry of Science and Innovation (contract TIN2012-34557) and by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272).
dc.format.extent10 p.
dc.language.isoeng
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)
dc.subjectÀrees temàtiques de la UPC::Informàtica
dc.subject.lcshFault-tolerant computing
dc.subject.lcshHigh performance computing
dc.subject.otherLibraries
dc.subject.otherFault tolerant systems
dc.subject.otherCheckpointing
dc.subject.otherRedundancy
dc.subject.otherTools
dc.subject.otherResilience
dc.subject.otherCheckpoint/restart
dc.subject.otherResiliency
dc.subject.otherFault tolerance
dc.subject.otherEx-ascale
dc.subject.otherProgrammability
dc.subject.otherProgramming models
dc.titleA directive-based approach to perform persistent checkpoint/restart
dc.typeConference report
dc.subject.lemacTolerància als errors (Informàtica)
dc.subject.lemacCàlcul intensiu (Informàtica)
dc.contributor.groupUniversitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.identifier.doi10.1109/HPCS.2017.72
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttp://ieeexplore.ieee.org/abstract/document/8035111/
dc.rights.accessRestricted access - publisher's policy
local.identifier.drac21548835
dc.description.versionPostprint (published version)
dc.date.lift10000-01-01
local.citation.authorMaroñas, M.; Mateo, S.; Beltran, V.; Ayguade, E.
local.citation.contributorInternational Conference on High Performance Computing and Simulation
local.citation.pubplaceGenoa
local.citation.publicationNameHPCS 2017: 2017 International Conference on High Performance Computing & Simulation: proceedings: 17-21 July 2017: Genoa, Italy
local.citation.startingPage442
local.citation.endingPage451


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record