Distributed training of deep neural networks with spark: The MareNostrum experience
Rights accessRestricted access - publisher's policy (embargoed until 2021-04-23)
Deployment of a distributed deep learning technology stack on a large parallel system is a very complex process, involving the integration and configuration of several layers of both, general-purpose and custom software. The details of such kind of deployments are rarely described in the literature. This paper presents the experiences observed during the deployment of a technology stack to enable deep learning workloads on MareNostrum, a petascale supercomputer. The components of a layered architecture, based on the usage of Apache Spark, are described and the performance and scalability of the resulting system is evaluated. This is followed by a discussion about the impact of different configurations including parallelism, storage and networking alternatives, and other aspects related to the execution of deep learning workloads on a traditional HPC setup. The derived conclusions should be useful to guide similarly complex deployments in the future.
CitationCruz, L.; Tous, R.; Otero, B. Distributed training of deep neural networks with spark: The MareNostrum experience. "Pattern recognition letters", 1 Juliol 2019, vol. 125, p. 174-178.