Towards resilient EU HPC systems: A blueprint

Document typeResearch report
Defense date2020-04
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial
property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public
communication or transformation of this work are prohibited without permission of the copyright holder
ProjectCLERECO - Cross-Layer Early Reliability Evaluation for the Computing cOntinuum (EC-FP7-611404)
RECIPE - REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems (EC-H2020-801137)
EUROLAB4HPC2 - Consolidation of European Research Excellence in Exascale HPC Systems (EC-H2020-800962)
EPI SGA1 - SGA1 (Specific Grant Agreement 1) OF THE EUROPEAN PROCESSOR INITIATIVE (EPI) (EC-H2020-826647)
ExaNoDe - European Exascale Processor Memory Node Design (EC-H2020-671578)
LEGaTO - Low Energy Toolset for Heterogeneous Computing (EC-H2020-780681)
Mont-Blanc 2020 - Mont-Blanc 2020, European scalable, modular and power efficient HPC processor (EC-H2020-779877)
RECIPE - REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems (EC-H2020-801137)
EUROLAB4HPC2 - Consolidation of European Research Excellence in Exascale HPC Systems (EC-H2020-800962)
EPI SGA1 - SGA1 (Specific Grant Agreement 1) OF THE EUROPEAN PROCESSOR INITIATIVE (EPI) (EC-H2020-826647)
ExaNoDe - European Exascale Processor Memory Node Design (EC-H2020-671578)
LEGaTO - Low Energy Toolset for Heterogeneous Computing (EC-H2020-780681)
Mont-Blanc 2020 - Mont-Blanc 2020, European scalable, modular and power efficient HPC processor (EC-H2020-779877)
Abstract
This document aims to spearhead a Europe-wide discussion on HPC system resilience and to help the European HPC community define best practices for resilience. We analyse a wide range of state-of-the-art resilience mechanisms and recommend the most effective approaches to employ in large-scale HPC systems. Our guidelines will be useful in the allocation of available resources, as well as guiding researchers and research funding towards the enhancement of resilience approaches with the highest priority and utility. Although our work is focused on the needs of next generation HPC systems in Europe, the principles and evaluations are applicable globally.
CitationRadojkovic, P. [et al.]. Towards resilient EU HPC systems: A blueprint. 2020.
URL other repositoryhttps://resilienthpc.eu/
Collections
- CAP - Grup de Computació d'Altes Prestacions - Reports de recerca [58]
- Doctorat en Arquitectura de Computadors - Reports de recerca [5]
- VIRTUOS - Virtualisation and Operating Systems - Reports de recerca [2]
- Departament d'Arquitectura de Computadors - Reports de recerca [177]
- Computer Sciences - Reports de recerca [15]
Files | Description | Size | Format | View |
---|---|---|---|---|
Blueprint2020__ ... silient-EU-HPC-Systems.pdf | 790,5Kb | View/Open |