Improving the performance of a deep learning framework on high-performance computing (HPC) systems
Títol de la revista
ISSN de la revista
Títol del volum
Autors
Correu electrònic de l'autor
gmail.com Tutor / director
Tribunal avaluador
Realitzat a/amb
Tipus de document
Data
Condicions d'accés
Llicència
Publicacions relacionades
Datasets relacionats
Projecte CCD
Abstract
In this work we intend to improve the performance of the library Pytorch 1.13.1 in HighPerformance Computing (HPC) applications for Central Processing Units (CPUs). The Pytorch framework is an open-source library that is intended to ease the burden of programming neural networks for Machine Learning (ML) purposes. Since its creation by Facebook, Pytorch has offered parallelism features. However, these options are fixed, meaning that they can’t be changed during the training and inference processes of neural networks. In these processes, some sections of a network have more tasks that can be parallelized than others. Therefore, if the parallelism parameters offered by Pytorch can only be selected in a fixed way, the program won’t run at its best efficiency for the most part. A solution to that problem would be to dynamically change the parallelism features in Pytorch, according to the nature of the neural network architecture at a certain layer. To showcase this, we selected a type of neural network called Long Short-Term Memory (LSTM), which has varying widths. In other words, we chose a neural network containing sections where many operations can run in parallel, and sections where only a few operations can run in parallel, or none. The network we’ve used is currently the state-ofthe-art for NLP (Natural Language Processing), making it the perfect network to study for HPC applications. Such applications contain tasks like machine translation, text generation and next-word prediction. In the present study, we’ve developed a use case where an LSTM network is used in an inference process. In the use case, the network ran using three different settings: without any parallelism configurations, with fixed parallelism configurations and by dynamically tuning the parallelism configurations. The objective of this work is to show that by approaching parallelism in a dynamic way, many Pytorch applications can see a huge improvement in terms of performance. Because of that, many companies that currently use these technologies for their products, such as Google, Facebook or Tesla, could see their costs reduced by a large margin. Moreover, by improving the performance of Pytorch 1.13.1 in inference, we could deploy deep learning models in other devices that previously couldn’t run them, such as mobile and edge devices. In addition, not only there’s a substantial economic interest behind our work, but also an environmental side to it: by improving the performance of deep learning frameworks in training and inference, we can reduce the carbon footprints generated during these processes. This is especially important for HPC applications, where the environmental cost of computing is very high. To showcase these results, we programmed several use cases with Python, in a HPC environment offered by the Barcelona Supercomputing Center (BSC). In the end, we achieved a 10% improvement in performance, and created simple guidelines to outperform 78% of Pytorch’s configurations



