Distributed training of deep neural networks with spark: The MareNostrum experience

Cruz, Leonel; Tous Liesa, Rubén; Otero Calviño, Beatriz

doi:10.1016/j.patrec.2019.01.020

Visualitza/Obre

PRL2018_upc.pdf (154,9Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Cruz, Leonel

Tous Liesa, Rubén

Otero Calviño, Beatriz

Tipus de documentArticle

Data publicació2019-07-01

EditorElsevier

Condicions d'accésAccés obert

Attribution-NonCommercial-NoDerivs 3.0 Spain

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya

ProjecteCOMPUTACION DE ALTAS PRESTACIONES VII (MINECO-TIN2015-65316-P)

Abstract

Deployment of a distributed deep learning technology stack on a large parallel system is a very complex process, involving the integration and configuration of several layers of both, general-purpose and custom software. The details of such kind of deployments are rarely described in the literature. This paper presents the experiences observed during the deployment of a technology stack to enable deep learning workloads on MareNostrum, a petascale supercomputer. The components of a layered architecture, based on the usage of Apache Spark, are described and the performance and scalability of the resulting system is evaluated. This is followed by a discussion about the impact of different configurations including parallelism, storage and networking alternatives, and other aspects related to the execution of deep learning workloads on a traditional HPC setup. The derived conclusions should be useful to guide similarly complex deployments in the future.

CitacióCruz, L.; Tous, R.; Otero, B. Distributed training of deep neural networks with spark: The MareNostrum experience. "Pattern recognition letters", 1 Juliol 2019, vol. 125, p. 174-178.

URIhttp://hdl.handle.net/2117/169362

DOI10.1016/j.patrec.2019.01.020

ISSN0167-8655

Versió de l'editorhttps://www.sciencedirect.com/science/article/abs/pii/S0167865519300145

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
PRL2018_upc.pdf		154,9Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Distributed training of deep neural networks with spark: The MareNostrum experience

Visualitza/Obre

Explora