|dc.description.abstract||These days, working with deep neural networks goes hand in hand with the use of
GPUs. Once a deep neural network has been trained for hours, days, or even weeks on
a desktop GPU, it is deployed in the field where it runs inference computations, which
are far less expensive than training. This fact, in conjunction with the mobile nature of
deep learning applications, makes very interesting the possibility of running inference
locally on a mobile device.
There already exist mobile applications that can perform tasks involving deep neural
networks, but they rely on remote servers to run the most expensive computations.
This is not ideal because the user’s privacy may be compromised or the algorithm
performance may be damaged due to latency issues on a poor network connection. In
this project, the possibility of running inference natively on a mobile GPU is explored.
One of the main applications of deep learning is object recognition, which encompasses
different problems such as classification, identification and detection. In this
project, we select a very deep neural network called Faster R-CNN, which solves a
detection problem, and optimize it to run natively on a mobile platform. This innovation
provides to a mobile device, such as a smartphone or a tablet, the capability of
identifying objects and its location in images, potentially improving the performance
of applications that currently employ the device camera combined with deep neural
However, mobile devices have limitations in power, memory, and compute capability.
This makes power and memory hungry applications such as deep neural networks
hard to deploy, requiring smart software design. As a result, mobile presents both an
opportunity and challenge for machine learning systems.
A preliminary profiling of the network on the Nvidia Jetson TX1 module, a stateof-the-art
platform used in modern smartphones and handheld consoles, shows that the
convolutional and fully-connected layers take most of the forward pass execution time,
up to 88.16% of the total. The network parameters take 548.3 MBytes of space, an
87.2% of which belong to fully-connected layers. Hence, the main performance and
energy bottlenecks are on the convolutional and fully-connected layers.
In order to overcome these bottlenecks, two optimizations are proposed. In first
place, we use half-precision floating points instead of single-precision; this reduces by
half the memory bandwidth, improving performance and providing energy savings. In
second place, we implement a neuron-pruning technique to remove up to 80% of the
neurons in the fully-connected layers; pruning reduces the memory footprint of the
network model and the amount of FP operations, reducing both energy consumption
and execution time.
To evaluate the aforementioned optimizations, thorough experimentation is carried
out on a Nvidia Jetson TX1 module. Results show that, combining all the optimizations,
we obtain, on average, a speedup of 1.55x, an energy reduction of 31.3%, an improvement
in energy-delay of 2.26x and a memory footprint reduction of 86%.