Designing and modelling selective replication for fault-tolerant HPC applications

View/Open
Document typeConference report
Defense date2017
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessOpen Access
Abstract
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
CitationSubasi, O., Yalcin, G., Zyulkyarov, F., Unsal, O., Labarta, J. Designing and modelling selective replication for fault-tolerant HPC applications. A: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. "2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing: 14-17 May 2017, Madrid, Spain: proceedings". Madrid: Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 452-457.
ISBN978-1-5090-6610-0
Publisher versionhttp://ieeexplore.ieee.org/document/7973731/
Files | Description | Size | Format | View |
---|---|---|---|---|
Designing+and+Modelling.pdf | 352,7Kb | View/Open |
All rights reserved. This work is protected by the corresponding intellectual and industrial
property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public
communication or transformation of this work are prohibited without permission of the copyright holder