Unified fault-tolerance framework for hybrid task-parallel message-passing applications

Subasi, Omer; Martsinkevich, Tatiana; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Cappello, Franck

doi:10.1177/1094342016669416

dc.contributor.author	Subasi, Omer
dc.contributor.author	Martsinkevich, Tatiana
dc.contributor.author	Zyulkyarov, Ferad
dc.contributor.author	Unsal, Osman Sabri
dc.contributor.author	Labarta Mancho, Jesús José
dc.contributor.author	Cappello, Franck
dc.contributor.other	Barcelona Supercomputing Center
dc.date.accessioned	2016-10-19T13:17:56Z
dc.date.available	2017-10-03T00:30:37Z
dc.date.issued	2016-09-26
dc.identifier.citation	Subasi, Omer [et al.]. Unified fault-tolerance framework for hybrid task-parallel message-passing applications. "International Journal of High Performance Computing Applications", 26 Setembre 2016.
dc.identifier.issn	1094-3420
dc.identifier.uri	http://hdl.handle.net/2117/90874
dc.description.abstract	We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.
dc.description.sponsorship	The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.
dc.format.extent	17 p.
dc.language.iso	eng
dc.publisher	SAGE Publications
dc.subject	Àrees temàtiques de la UPC::Enginyeria electrònica
dc.subject.lcsh	Fault-tolerant computing
dc.subject.lcsh	Mathematical modeling and computation
dc.subject.other	Fault-tolerance
dc.subject.other	Message logging
dc.subject.other	Checkpoint/restart
dc.subject.other	Task-based programming model
dc.subject.other	Optimal checkpointing interval
dc.title	Unified fault-tolerance framework for hybrid task-parallel message-passing applications
dc.type	Article
dc.subject.lemac	Models matemàtics
dc.subject.lemac	Programació (Ordinadors)
dc.identifier.doi	10.1177/1094342016669416
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	http://hpc.sagepub.com/content/early/2016/09/26/1094342016669416.abstract
dc.rights.access	Open Access
local.identifier.drac	23515513
dc.description.version	Postprint (author's final draft)
dc.relation.projectid	info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/
dc.relation.projectid	info:eu-repo/grantAgreement/EC/FP7/610402/EU/Mont-Blanc 2, European scalable and power efficient HPC platform based on low-power embedded technology/MONT-BLANC 2
local.citation.publicationName	International Journal of High Performance Computing Applications
local.citation.volume	32
local.citation.number	5
local.citation.startingPage	641
local.citation.endingPage	657

Fitxers d'aquest items

Nom:: Unified fault-tolerance framework ...
Mida:: 801,3Kb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Articles de revista [318]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Unified fault-tolerance framework for hybrid task-parallel message-passing applications

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora