Mostra el registre d'ítem simple
A directive-based approach to perform persistent checkpoint/restart
dc.contributor.author | Maroñas, Marcos |
dc.contributor.author | Mateo, Sergi |
dc.contributor.author | Beltran Querol, Vicenç |
dc.contributor.author | Ayguadé Parra, Eduard |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors |
dc.contributor.other | Barcelona Supercomputing Center |
dc.date.accessioned | 2017-09-22T13:52:38Z |
dc.date.issued | 2017 |
dc.identifier.citation | Maroñas, M., Mateo, S., Beltran, V., Ayguade, E. A directive-based approach to perform persistent checkpoint/restart. A: International Conference on High Performance Computing and Simulation. "HPCS 2017: 2017 International Conference on High Performance Computing & Simulation: proceedings: 17-21 July 2017: Genoa, Italy". Genoa: Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 442-451. |
dc.identifier.isbn | 978-1-5386-3249-9 |
dc.identifier.uri | http://hdl.handle.net/2117/107925 |
dc.description.abstract | Exascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and deserialization activities to our library, which relies on SCR/FTI to perform scalable and efficient I/O operations. Our results, based on several benchmarks and two large applications, reveal no additional overhead compared to the direct use of FTI and SCR checkpoint/restart libraries. Apart from that, our portable approach significantly increases the programmability reducing the number of code lines required to perform checkpoint/restart in an average of ˜ 82% and ˜ 94%, for FTI and SCR respectively. |
dc.description.sponsorship | The research leading to these results has received funding from the European Community Seventh Framework Programme (FP7/2007-2013) via the DEEP-ER project under Grant Agreement number 610476. This work has been also supported by the Spanish Ministry of Science and Innovation (contract TIN2012-34557) and by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). |
dc.format.extent | 10 p. |
dc.language.iso | eng |
dc.publisher | Institute of Electrical and Electronics Engineers (IEEE) |
dc.subject | Àrees temàtiques de la UPC::Informàtica |
dc.subject.lcsh | Fault-tolerant computing |
dc.subject.lcsh | High performance computing |
dc.subject.other | Libraries |
dc.subject.other | Fault tolerant systems |
dc.subject.other | Checkpointing |
dc.subject.other | Redundancy |
dc.subject.other | Tools |
dc.subject.other | Resilience |
dc.subject.other | Checkpoint/restart |
dc.subject.other | Resiliency |
dc.subject.other | Fault tolerance |
dc.subject.other | Ex-ascale |
dc.subject.other | Programmability |
dc.subject.other | Programming models |
dc.title | A directive-based approach to perform persistent checkpoint/restart |
dc.type | Conference report |
dc.subject.lemac | Tolerància als errors (Informàtica) |
dc.subject.lemac | Càlcul intensiu (Informàtica) |
dc.contributor.group | Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
dc.identifier.doi | 10.1109/HPCS.2017.72 |
dc.description.peerreviewed | Peer Reviewed |
dc.relation.publisherversion | http://ieeexplore.ieee.org/abstract/document/8035111/ |
dc.rights.access | Restricted access - publisher's policy |
local.identifier.drac | 21548835 |
dc.description.version | Postprint (published version) |
dc.date.lift | 10000-01-01 |
local.citation.author | Maroñas, M.; Mateo, S.; Beltran, V.; Ayguade, E. |
local.citation.contributor | International Conference on High Performance Computing and Simulation |
local.citation.pubplace | Genoa |
local.citation.publicationName | HPCS 2017: 2017 International Conference on High Performance Computing & Simulation: proceedings: 17-21 July 2017: Genoa, Italy |
local.citation.startingPage | 442 |
local.citation.endingPage | 451 |