Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

Jaulmes, Luc; Casas, Marc; Moretó Planas, Miquel; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José; Valero Cortés, Mateo

doi:10.1145/2807591.2807599

dc.contributor.author	Jaulmes, Luc
dc.contributor.author	Casas, Marc
dc.contributor.author	Moretó Planas, Miquel
dc.contributor.author	Ayguadé Parra, Eduard
dc.contributor.author	Labarta Mancho, Jesús José
dc.contributor.author	Valero Cortés, Mateo
dc.contributor.other	Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor.other	Barcelona Supercomputing Center
dc.date.accessioned	2016-07-05T08:16:54Z
dc.date.available	2016-07-05T08:16:54Z
dc.date.issued	2015
dc.identifier.citation	Jaulmes, L., Casas, M., Moreto, M., Ayguadé, E., Labarta, J., Valero, M. Exploiting asynchrony from exact forward recovery for DUE in iterative solvers. A: International Conference for High Performance Computing, Networking, Storage and Analysis. "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015)". Austin, TX: Association for Computing Machinery (ACM), 2015, p. 53:1-53:12.
dc.identifier.isbn	978-1-4503-3723-6
dc.identifier.uri	http://hdl.handle.net/2117/88494
dc.description.abstract	This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.
dc.description.sponsorship	This work has been partially supported by the European Research Council under the European Union's 7th FP, ERC Advanced Grant 321253, and by the Spanish Ministry of Science and Innovation under grant TIN2012-34557. L. Jaulmes has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2013/06982. M. Moreto has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas has been partially supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Co-fund programme of the Marie Curie Actions of the European Union's 7th FP (contract 2013 BP B 00243).
dc.language.iso	eng
dc.publisher	Association for Computing Machinery (ACM)
dc.subject	Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject.lcsh	Error-correcting codes (Information theory)
dc.title	Exploiting asynchrony from exact forward recovery for DUE in iterative solvers
dc.type	Conference report
dc.subject.lemac	Codis correctors d'errors (Teoria de la informació)
dc.contributor.group	Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.identifier.doi	10.1145/2807591.2807599
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	http://dl.acm.org/citation.cfm?doid=2807591.2807599
dc.rights.access	Open Access
local.identifier.drac	17530716
dc.description.version	Postprint (author's final draft)
dc.relation.projectid	info:eu-repo/grantAgreement/EC/FP7/321253/EU/Riding on Moore's Law/ROMOL
local.citation.author	Jaulmes, L.; Casas, M.; Moreto, M.; Ayguadé, E.; Labarta, J.; Valero, M.
local.citation.contributor	International Conference for High Performance Computing, Networking, Storage and Analysis
local.citation.pubplace	Austin, TX
local.citation.publicationName	Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2015)
local.citation.startingPage	53:1
local.citation.endingPage	53:12

Fitxers d'aquest items

Nom:: ljaulmes-SC2015-openaccess-1.pdf
Mida:: 602,2Kb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Ponències/Comunicacions de congressos [574]
Ponències/Comunicacions de congressos [784]
Ponències/Comunicacions de congressos [1.954]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora