Designing and modelling selective replication for fault-tolerant HPC applications
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessOpen Access
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
CitationSubasi, O., Yalcin, G., Zyulkyarov, F., Unsal, O., Labarta, J. Designing and modelling selective replication for fault-tolerant HPC applications. A: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. "2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing: 14-17 May 2017, Madrid, Spain: proceedings". Madrid: Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 452-457.
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder