An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks

Njoroge Kahira, Albert; Nguyen, Truong Thao; Bautista Gomez, Leonardo; Takano, Ryousei; Badia Sala, Rosa Maria; Wahib, Mohamed

doi:10.1145/3431379.3460644

Visualitza/Obre

An Oracle.pdf (1,276Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Njoroge Kahira, Albert

Nguyen, Truong Thao

Bautista Gomez, Leonardo

Takano, Ryousei

Badia Sala, Rosa Maria

Wahib, Mohamed

Tipus de documentText en actes de congrés

Data publicació2021

EditorAssociation for Computing Machinery (ACM)

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

ProjecteEUROLAB4HPC2 - Consolidation of European Research Excellence in Exascale HPC Systems (EC-H2020-800962)
INPhINIT - Innovative doctoral programme for talented early-stage researchers in Spanish host organisations excellent in the areas of Science, Technology, Engineering and Mathematics (STEM). (EC-H2020-713673)

Abstract

Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism.

CitacióKahira, A. [et al.]. An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks. A: ACM International Symposium on High-Performance Parallel and Distributed Computing. "HPDC'21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing: June 21-25, 2021, virtual event, Sweden". New York: Association for Computing Machinery (ACM), 2021, p. 161-173. ISBN 978-1-4503-8217-5. DOI 10.1145/3431379.3460644.

URIhttp://hdl.handle.net/2117/348972

DOI10.1145/3431379.3460644

ISBN978-1-4503-8217-5

Versió de l'editorhttps://dl.acm.org/doi/10.1145/3431379.3460644

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
An Oracle.pdf		1,276Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks

Visualitza/Obre

Explora