Mostra el registre d'ítem simple
Unified fault-tolerance framework for hybrid task-parallel message-passing applications
dc.contributor.author | Subasi, Omer |
dc.contributor.author | Martsinkevich, Tatiana |
dc.contributor.author | Zyulkyarov, Ferad |
dc.contributor.author | Unsal, Osman Sabri |
dc.contributor.author | Labarta Mancho, Jesús José |
dc.contributor.author | Cappello, Franck |
dc.contributor.other | Barcelona Supercomputing Center |
dc.date.accessioned | 2016-10-19T13:17:56Z |
dc.date.available | 2017-10-03T00:30:37Z |
dc.date.issued | 2016-09-26 |
dc.identifier.citation | Subasi, Omer [et al.]. Unified fault-tolerance framework for hybrid task-parallel message-passing applications. "International Journal of High Performance Computing Applications", 26 Setembre 2016. |
dc.identifier.issn | 1094-3420 |
dc.identifier.uri | http://hdl.handle.net/2117/90874 |
dc.description.abstract | We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme. |
dc.description.sponsorship | The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P. |
dc.format.extent | 17 p. |
dc.language.iso | eng |
dc.publisher | SAGE Publications |
dc.subject | Àrees temàtiques de la UPC::Enginyeria electrònica |
dc.subject.lcsh | Fault-tolerant computing |
dc.subject.lcsh | Mathematical modeling and computation |
dc.subject.other | Fault-tolerance |
dc.subject.other | Message logging |
dc.subject.other | Checkpoint/restart |
dc.subject.other | Task-based programming model |
dc.subject.other | Optimal checkpointing interval |
dc.title | Unified fault-tolerance framework for hybrid task-parallel message-passing applications |
dc.type | Article |
dc.subject.lemac | Models matemàtics |
dc.subject.lemac | Programació (Ordinadors) |
dc.identifier.doi | 10.1177/1094342016669416 |
dc.description.peerreviewed | Peer Reviewed |
dc.relation.publisherversion | http://hpc.sagepub.com/content/early/2016/09/26/1094342016669416.abstract |
dc.rights.access | Open Access |
local.identifier.drac | 23515513 |
dc.description.version | Postprint (author's final draft) |
dc.relation.projectid | info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/ |
dc.relation.projectid | info:eu-repo/grantAgreement/EC/FP7/610402/EU/Mont-Blanc 2, European scalable and power efficient HPC platform based on low-power embedded technology/MONT-BLANC 2 |
local.citation.publicationName | International Journal of High Performance Computing Applications |
local.citation.volume | 32 |
local.citation.number | 5 |
local.citation.startingPage | 641 |
local.citation.endingPage | 657 |
Fitxers d'aquest items
Aquest ítem apareix a les col·leccions següents
-
Articles de revista [318]