Managing failures in task-based parallel workflows in distributed computing environments
Visualitza/Obre
10.1007/978-3-030-57675-2_26
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/328312
Tipus de documentCapítol de llibre
Data publicació2020
EditorSpringer, Cham
Condicions d'accésAccés obert
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
ProjecteBioExcel-2 - BioExcel Centre of Excellence for ComputationalBiomolecular Research (EC-H2020-823830)
BioExcel - Centre of Excellence for Biomolecular Research (EC-H2020-675728)
BioExcel - Centre of Excellence for Biomolecular Research (EC-H2020-675728)
Abstract
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
CitacióEjarque, J. [et al.]. Managing failures in task-based parallel workflows in distributed computing environments. A: Malawski, M.; Rzadca, K.. "Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science, vol 12247". Springer, Cham, 2020, p. 411-425.
ISBN978-3-030-57674-5
978-3-030-57675-2
978-3-030-57675-2
Versió de l'editorhttps://link.springer.com/chapter/10.1007/978-3-030-57675-2_26
Col·leccions
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
Europar_Failure_mangement_CR-1.pdf | 376,7Kb | Visualitza/Obre |