Distributed training of deep neural networks with spark: The MareNostrum experience
Visualitza/Obre
10.1016/j.patrec.2019.01.020
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/169362
Tipus de documentArticle
Data publicació2019-07-01
EditorElsevier
Condicions d'accésAccés obert
Llevat que s'hi indiqui el contrari, els
continguts d'aquesta obra estan subjectes a la llicència de Creative Commons
:
Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya
Abstract
Deployment of a distributed deep learning technology stack on a large parallel system is a very complex process, involving the integration and configuration of several layers of both, general-purpose and custom software. The details of such kind of deployments are rarely described in the literature. This paper presents the experiences observed during the deployment of a technology stack to enable deep learning workloads on MareNostrum, a petascale supercomputer. The components of a layered architecture, based on the usage of Apache Spark, are described and the performance and scalability of the resulting system is evaluated. This is followed by a discussion about the impact of different configurations including parallelism, storage and networking alternatives, and other aspects related to the execution of deep learning workloads on a traditional HPC setup. The derived conclusions should be useful to guide similarly complex deployments in the future.
CitacióCruz, L.; Tous, R.; Otero, B. Distributed training of deep neural networks with spark: The MareNostrum experience. "Pattern recognition letters", 1 Juliol 2019, vol. 125, p. 174-178.
ISSN0167-8655
Versió de l'editorhttps://www.sciencedirect.com/science/article/abs/pii/S0167865519300145
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
PRL2018_upc.pdf | 154,9Kb | Visualitza/Obre |