• Application-level differential checkpointing for HPC applications with dynamic datasets 

      Keller, Kai Rasmus; Bautista Gomez, Leonardo (Institute of Electrical and Electronics Engineers (IEEE), 2019)
      Text en actes de congrés
      Accés obert
      High-performance computing (HPC) requires resilience techniques such as checkpointing in order to tolerate failures in supercomputers. As the number of nodes and memory in supercomputers keeps on increasing, the size of ...
    • Checkpoint restart support for heterogeneous HPC applications 

      Parasyris, Konstantinos; Keller, Kai Rasmus; Bautista Gomez, Leonardo; Unsal, Osman Sabri (Institute of Electrical and Electronics Engineers (IEEE), 2020)
      Text en actes de congrés
      Accés obert
      As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of ...
    • Design and study of elastic recovery in HPC applications 

      Keller, Kai Rasmus; Parasyris, Konstantinos; Bautista Gomez, Leonardo (Institute of Electrical and Electronics Engineers (IEEE), 2020)
      Text en actes de congrés
      Accés obert
      The efficient utilization of current supercomputing systems with deep storage hierarchies demands scientific applications that are capable of leveraging such heterogeneous hardware. Fault tolerance, and checkpointing in ...
    • Extending the OpenCHK Model with advanced checkpoint features 

      Maroñas Bravo, Marcos; Mateo Bellido, Sergi; Keller, Kai Rasmus; Bautista Gomez, Leonardo; Ayguadé Parra, Eduard; Beltran Querol, Vicenç (Elsevier, 2020-11)
      Article
      Accés obert
      One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and ...
    • Resilience for large ensemble computations 

      Keller, Kai Rasmus (Universitat Politècnica de Catalunya, 2022-07-01)
      Tesi
      Accés obert
      With the increasing power of supercomputers, ever more detailed models of physical systems can be simulated, and ever larger problem sizes can be considered for any kind of numerical system. During the last twenty years ...
    • Towards zero-waste recovery and zero-overhead checkpointing in ensemble data assimilation 

      Keller, Kai Rasmus; Cristal Kestelman, Adrián; Bautista Gomez, Leonardo (Institute of Electrical and Electronics Engineers (IEEE), 2021)
      Text en actes de congrés
      Accés obert
      Ensemble data assimilation is a powerful tool for increasing the accuracy of climatological states. It is based on combining observations with the results from numerical model simulations. The method comprises two steps, ...