Show simple item record

dc.contributor.authorPerarnau, Swann
dc.contributor.authorBautista-Gomez, Leonardo
dc.contributor.otherBarcelona Supercomputing Center
dc.date.accessioned2017-04-11T08:11:19Z
dc.date.available2017-04-11T08:11:19Z
dc.date.issued2017-04-06
dc.identifier.citationPerarnau, S.; Bautista-Gomez, L. Monitoring strategies for scalable dynamic checkpointing. A: "2016 Seventh International Green and Sustainable Computing Conference (IGSC)". Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 1-8.
dc.identifier.isbn978-1-5090-5117-5
dc.identifier.urihttp://hdl.handle.net/2117/103468
dc.description.abstractResilience is an important challenge for extreme-scale supercomputers. Failures in current supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes. Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability on production systems, as well as to identifying possible avenues for future improvements. In particular, we evaluate the ability of our system to monitor as many types of events as possible, measure their importance, and forward them to the resilience runtime.
dc.description.sponsorshipResults presented in this paper were obtained using the Chameleon testbed supported by the National Science Foundation. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computer Research, under Contract DEAC02-06CH11357. This research has also received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-Blanc 2 Project (www.montblancproject.eu), grant agreement No610402 and it has been supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
dc.format.extent8 p.
dc.language.isoeng
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)
dc.subjectÀrees temàtiques de la UPC::Enginyeria electrònica
dc.subject.lcshSupercomputers
dc.subject.lcshHardware
dc.subject.lcshResilience
dc.subject.lcshHigh performance computing
dc.subject.otherSupercomputers
dc.subject.otherFault Tolerance
dc.subject.otherResilience
dc.subject.otherIntrospective Systems
dc.subject.otherFailures
dc.subject.otherHigh-Performance Computing
dc.titleMonitoring strategies for scalable dynamic checkpointing
dc.typeConference lecture
dc.subject.lemacSupercomputadors
dc.subject.lemacAlta tecnologia
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttp://ieeexplore.ieee.org/document/7892626/
dc.rights.accessOpen Access
dc.description.versionPostprint (author's final draft)
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO/1PE/TIN2015-65316-P
upcommons.citation.publishedtrue
upcommons.citation.publicationName2016 Seventh International Green and Sustainable Computing Conference (IGSC)
upcommons.citation.startingPage1
upcommons.citation.endingPage8


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder