Image analysis using deep-learning over a distributed platform such as spark in the Marenostrum
Document typeMaster thesis
Rights accessOpen Access
In recent years the digital universe has grown exponentially, the common use of the Internet has led to a new way to generate and consume information, only 10 years ago, it was necessary a deployment of technological infrastructure a lot expensive in order to collect data from several areas, they are ranging from a simple market analysis to the most specialized scientific research in complex areas such as medicine, physics, astronomy etc. But because of the maturity of the information age, costs are down, access to information is available or can be generated from low-cost devices giving rise to Bigdata. This opens doors to a new world of opportunities such as image processing and possible both commercial and scientific applications. Clear examples of current commercial solutions such as image recognition for security systems of vehicles in circulation or current investigations as automatic driving or detection of diseases from images of organs of patients obtained by scanners. This thesis deals the problem of the processing and classification of images in categories, obtained from any data source, using for this approach Deep-Learning. For this we have implemented a convolutional neural network architecture based on Java. Composed of an ETL module that handles the loading of images in raw format and transforms into tensioners for treatment within ConvNet, the configuration of the convolutional network is made up of 12 layers, 6 convolutional layers and 5 MaxPooling layers, together with a fully-connected layer. The convolutional layers were configured with Relu as activation function, while the last layer that performs the classification was treated with Softmax function. The application works in two ways developed stand alone and distributed, in both cases the framework used for the development of this thesis was Deeplearning4j which is based on java, for the development of convolutional networks. In addition to handling n-dimensional array, linear algebra and signal processing functions NDj4 were used which is a scientific computing libraries for Java that complements the aforementioned framework. For the distributed environment Apache Spark was used as a distributed cluster, and RDD (Resilient Distributed Datasets) for the distribution model over the nodes that form the Spark cluster. The results obtained from the experiments were different depending on the parameters supplied, these ranged from 60 to 80 of accuracy. The duration of the training process model in stand alone mode take considerable time unlike the distributed (minutes vs. hours). For future work this project could be coupled to any other system performing object recognition, especially if they are developed in Java thus being a multiplatform solution.