Enabling vehicular data with distributed machine learning
Rights accessRestricted access - publisher's policy (embargoed until 2016-12-12)
Vehicular Data includes different facts and measurements made over a set of moving vehicles. Most of us use cars or public transportation for our work commute, daily routines and leisure. But, except of our destination, possible time of arrival and what is directly around us, we know very little about the traffic conditions in the city as a whole. Because all roads are connected in a vast network, events in other parts of town can and will directly affect us. The more we know about the traffic inside a city, the better decisions we can make. Vehicular measurements may contain a vast amount of information about the way our cities function. Information that can be used for more than improving our commute, it is indicative of other features of the city like the amount of pollution in different regions. All the information and knowledge we can extract, can be used to directly improve our life. We live in a world where data is constantly generated and we store it and process it at an ever growing rate. Vehicular Data does not stray from this fact and is rapidly growing in size and complexity, with more and more ways to monitoring traffic, either from inside cars or from sensors placed on the road. Smartphones and in-car-computers are now common and they can produce a vast amount of data: it can identify a cars location, destination, current speed and even driving habits. Machine learning is the perfect complement for Big Data, as large data sets can be rendered useless without methods to extract knowledge and information from them. Machine learning, currently a popular research topic, has a large number of algorithms design to achieve this task, of knowledge extraction. Most of these techniques and algorithms can be directly applied to Vehicular Data. In this article we demonstrate how the use of a simple algorithm, k-Nearest Neighbors, can be used to extract valuable information from even a relatively small vehicular data set. Because of the vast size of most of our cities and the number of cars that are on their roads at any time of the day, standard machine learning systems do not manage to process data in a manner that would permit real time use of the extracted information. A solution to this problem is brought by distributed systems and cloud processing. By parallelizing and distributing machine learning algorithms we can use data at its highest potential and with little delay. Here, we show how this can be achieved by distributing the k-Nearest Neighbors machine learning algorithm over MPI. We hope this would motivate the research into other combinations of merging machine learning algorithms with Vehicular Data sets.
This is a copy of the author 's final draft version of an article published in the journal Transactions on computational collective intelligence. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-662-49017-4_6
CitationChilipirea, C., Petre, A., Dobre, C., Pop, F., Xhafa, F. Enabling vehicular data with distributed machine learning. "Transactions on computational collective intelligence", 2015, vol. 19, p. 89-102.