Monitoring strategies for scalable dynamic checkpointing

Perarnau, Swann; Bautista-Gomez, Leonardo

dc.contributor.author	Perarnau, Swann
dc.contributor.author	Bautista-Gomez, Leonardo
dc.contributor.other	Barcelona Supercomputing Center
dc.date.accessioned	2017-04-11T08:11:19Z
dc.date.available	2017-04-11T08:11:19Z
dc.date.issued	2017-04-06
dc.identifier.citation	Perarnau, S.; Bautista-Gomez, L. Monitoring strategies for scalable dynamic checkpointing. A: "2016 Seventh International Green and Sustainable Computing Conference (IGSC)". Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 1-8.
dc.identifier.isbn	978-1-5090-5117-5
dc.identifier.uri	http://hdl.handle.net/2117/103468
dc.description.abstract	Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes. Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability on production systems, as well as to identifying possible avenues for future improvements. In particular, we evaluate the ability of our system to monitor as many types of events as possible, measure their importance, and forward them to the resilience runtime.
dc.description.sponsorship	Results presented in this paper were obtained using the Chameleon testbed supported by the National Science Foundation. This material was based upon work supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computer Research, under Contract DEAC02-06CH11357. This research has also received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-Blanc 2 Project (www.montblancproject.eu), grant agreement No610402 and it has been supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P.
dc.format.extent	8 p.
dc.language.iso	eng
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.subject	Àrees temàtiques de la UPC::Enginyeria electrònica
dc.subject.lcsh	Supercomputers
dc.subject.lcsh	Hardware
dc.subject.lcsh	Resilience
dc.subject.lcsh	High performance computing
dc.subject.other	Supercomputers
dc.subject.other	Fault Tolerance
dc.subject.other	Resilience
dc.subject.other	Introspective Systems
dc.subject.other	Failures
dc.subject.other	High-Performance Computing
dc.title	Monitoring strategies for scalable dynamic checkpointing
dc.type	Conference lecture
dc.subject.lemac	Supercomputadors
dc.subject.lemac	Alta tecnologia
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	http://ieeexplore.ieee.org/document/7892626/
dc.rights.access	Open Access
dc.description.version	Postprint (author's final draft)
dc.relation.projectid	info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/
local.citation.publicationName	2016 Seventh International Green and Sustainable Computing Conference (IGSC)
local.citation.startingPage	1
local.citation.endingPage	8

Fitxers d'aquest items

Nom:: Monitoring Strategies for Scalable ...
Mida:: 194,4Kb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Ponències/Comunicacions de congressos [574]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Monitoring strategies for scalable dynamic checkpointing

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora