Resilience design patterns: a structured approach to resilience at extreme scale

Hukerikar, Saurabh

Visualitza/Obre

Resilience_design_patterns.pdf (159,9Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Hukerikar, Saurabh

Tipus de documentText en actes de congrés

Data publicació2017-09-10

EditorBarcelona Supercomputing Center

Condicions d'accésAccés obert

Attribution-NonCommercial-NoDerivs 3.0 Spain

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya

Abstract

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency. Few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this talk, I will present a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems. We have developed a complete catalog of resilience design patterns, which provides designers with a collection of such reusable design elements. We have also defined a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also enables optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems.

URIhttp://hdl.handle.net/2117/108819

Col·leccions

Severo Ochoa Research Seminars at BSC - 3rd Severo Ochoa Research Seminar Lectures at BSC, Barcelona, 2016-2017 [20]

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
Resilience_design_patterns.pdf		159,9Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Resilience design patterns: a structured approach to resilience at extreme scale

Visualitza/Obre

Explora