Now showing items 1-8 of 8

    • A runtime heuristic to selectively replicate tasks for application-specific reliability targets 

      Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José (Institute of Electrical and Electronics Engineers (IEEE), 2016)
      Conference report
      Open Access
      In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require ...
    • CRC-based memory reliability for task-parallel HPC applications 

      Subasi, Omer; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Yalcin, Gulay; Cristal Kestelman, Adrián (Institute of Electrical and Electronics Engineers (IEEE), 2016)
      Conference report
      Restricted access - publisher's policy
      Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale ...
    • Designing and modelling selective replication for fault-tolerant HPC applications 

      Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José (Institute of Electrical and Electronics Engineers (IEEE), 2017)
      Conference report
      Open Access
      Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However ...
    • Exploring the capabilities of support vector machines in detecting silent data corruptions 

      Subasi, Omer; Di, Sheng; Bautista-Gomez, Leonardo; Balaprakash, Prasanna; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Cristal Kestelman, Adrián; Krishnamoorthy, Sriram; Cappello, Franck (Elsevier, 2018-09)
      Article
      Open Access
      As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions ...
    • Programmer-directed partial redundancy for resilient HPC 

      Subasi, Omer; Arias Moreno, Francisco Javier; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Cristal Kestelman, Adrián (Association for Computing Machinery (ACM), 2015)
      Conference report
      Restricted access - publisher's policy
      In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive ...
    • Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications 

      Subasi, Omer (Universitat Politècnica de Catalunya, 2016-10-27)
      Doctoral thesis
      Open Access
      As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future ...
    • Spatial support vector regression to detect silent errors in the exascale era 

      Subasi, Omer; Di, Sheng; Bautista Gómez, Leonardo; Balaprakash, Prasanna; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Cristal Kestelman, Adrián; Cappello, Franck (Institute of Electrical and Electronics Engineers (IEEE), 2016)
      Conference report
      Open Access
      As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions ...
    • Unified fault-tolerance framework for hybrid task-parallel message-passing applications 

      Subasi, Omer; Martsinkevich, Tatiana; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Cappello, Franck (SAGE Publications, 2016-09-26)
      Article
      Open Access
      We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the ...