Adapting deep neural networks to a low-power environment
Tutor / director / evaluatorGonzález Colás, Antonio María
Document typeBachelor thesis
Rights accessOpen Access
These days, working with deep neural networks goes hand in hand with the use of GPUs. Once a deep neural network has been trained for hours, days, or even weeks on a desktop GPU, it is deployed in the field where it runs inference computations, which are far less expensive than training. This fact, in conjunction with the mobile nature of deep learning applications, makes very interesting the possibility of running inference locally on a mobile device. There already exist mobile applications that can perform tasks involving deep neural networks, but they rely on remote servers to run the most expensive computations. This is not ideal because the user’s privacy may be compromised or the algorithm performance may be damaged due to latency issues on a poor network connection. In this project, the possibility of running inference natively on a mobile GPU is explored. One of the main applications of deep learning is object recognition, which encompasses different problems such as classification, identification and detection. In this project, we select a very deep neural network called Faster R-CNN, which solves a detection problem, and optimize it to run natively on a mobile platform. This innovation provides to a mobile device, such as a smartphone or a tablet, the capability of identifying objects and its location in images, potentially improving the performance of applications that currently employ the device camera combined with deep neural networks. However, mobile devices have limitations in power, memory, and compute capability. This makes power and memory hungry applications such as deep neural networks hard to deploy, requiring smart software design. As a result, mobile presents both an opportunity and challenge for machine learning systems. A preliminary profiling of the network on the Nvidia Jetson TX1 module, a stateof-the-art platform used in modern smartphones and handheld consoles, shows that the convolutional and fully-connected layers take most of the forward pass execution time, up to 88.16% of the total. The network parameters take 548.3 MBytes of space, an 87.2% of which belong to fully-connected layers. Hence, the main performance and energy bottlenecks are on the convolutional and fully-connected layers. In order to overcome these bottlenecks, two optimizations are proposed. In first place, we use half-precision floating points instead of single-precision; this reduces by half the memory bandwidth, improving performance and providing energy savings. In second place, we implement a neuron-pruning technique to remove up to 80% of the neurons in the fully-connected layers; pruning reduces the memory footprint of the network model and the amount of FP operations, reducing both energy consumption and execution time. To evaluate the aforementioned optimizations, thorough experimentation is carried out on a Nvidia Jetson TX1 module. Results show that, combining all the optimizations, we obtain, on average, a speedup of 1.55x, an energy reduction of 31.3%, an improvement in energy-delay of 2.26x and a memory footprint reduction of 86%.