Ir al contenido (pulsa Retorno)

Universitat Politècnica de Catalunya

    • Català
    • Castellano
    • English
    • LoginRegisterLog in (no UPC users)
  • mailContact Us
  • world English 
    • Català
    • Castellano
    • English
  • userLogin   
      LoginRegisterLog in (no UPC users)

UPCommons. Global access to UPC knowledge

Banner header
59.589 UPC E-Prints
You are here:
View Item 
  •   DSpace Home
  • E-prints
  • Centres de recerca
  • BSC - Barcelona Supercomputing Center
  • Computer Sciences
  • Ponències/Comunicacions de congressos
  • View Item
  •   DSpace Home
  • E-prints
  • Centres de recerca
  • BSC - Barcelona Supercomputing Center
  • Computer Sciences
  • Ponències/Comunicacions de congressos
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Towards Ad Hoc Recovery for Soft Errors

Thumbnail
View/Open
Towards Ad Hoc Recovery For Soft Errors.pdf (363,6Kb)
Share:
 
 
10.1109/FTXS.2018.00004
 
  View Usage Statistics
Cita com:
hdl:2117/133802

Show full item record
Losada, Nuria
Bautista-Gomez, Leonardo
Keller, Kai
Unsal, Osman
Document typeConference lecture
Defense date2018-12-06
PublisherIEEE
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder
ProjectDURO - DURO: Deep-memory Ubiquity, Reliability and Optimization (EC-H2020-708566)
BES-2014-068066 (MINECO-BES-2014-068066)
Abstract
The coming exascale era is a great opportunity for high performance computing (HPC) applications. However, high failure rates on these systems will hazard the successful completion of their execution. Bit-flip errors in dynamic random access memory (DRAM) account for a noticeable share of the failures in supercomputers. Hardware mechanisms, such as error correcting code (ECC), can detect and correct single-bit errors and can detect some multi-bit errors while others can go undiscovered. Unfortunately, detected multi-bit errors will most of the time force the termination of the application and lead to a global restart. Thus, other strategies at the software level are needed to tolerate these type of faults more efficiently and to avoid a global restart. In this work, we extend the FTI checkpointing library to facilitate the implementation of custom recovery strategies for MPI applications, minimizing the overhead introduced when coping with soft errors. The new functionalities are evaluated by implementing local forward recovery on three HPC benchmarks with different reliability requirements. Our results demonstrate a reduction on the recovery times by up to 14%.
CitationLosada, N. [et al.]. Towards Ad Hoc Recovery for Soft Errors. A: "2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)". IEEE, 2018, p. 1-10. 
URIhttp://hdl.handle.net/2117/133802
DOI10.1109/FTXS.2018.00004
ISBN978-1-7281-0222-1
Publisher versionhttps://ieeexplore.ieee.org/document/8564482
Collections
  • Computer Sciences - Ponències/Comunicacions de congressos [497]
Share:
 
  View Usage Statistics

Show full item record

FilesDescriptionSizeFormatView
Towards Ad Hoc Recovery For Soft Errors.pdf363,6KbPDFView/Open

Browse

This CollectionBy Issue DateAuthorsOther contributionsTitlesSubjectsThis repositoryCommunities & CollectionsBy Issue DateAuthorsOther contributionsTitlesSubjects

© UPC Obrir en finestra nova . Servei de Biblioteques, Publicacions i Arxius

info.biblioteques@upc.edu

  • About This Repository
  • Contact Us
  • Send Feedback
  • Privacy Settings
  • Inici de la pàgina