Large-scale retrospective event detection from tweets through a DBSCAN-like algorithm in Apache Spark
Tutor / director / evaluatorTorres Viñals, Jordi
CovenanteeBarcelona Supercomputing Center
Document typeMaster thesis
Rights accessOpen Access
Messages posted on Location-Based Social Networks (LBSNs) such as Twitter have been reporting everything from daily life stories to the latest local and global news. Monitoring and analyzing this rich and continuous user-generated content can yield unprecedentedly valuable information, enabling users and organizations to acquire priceless knowledge of the occurrences and events. As the size of user-data generated content is extremely large nowadays, parallel processing of complex data analysis becomes essential. In this context, we propose OctreeDBSCAN 3D, an event discovery technique based on a scalable and distributed implementation of the popular density-based clustering algorithm called DBSCAN. First, we evaluate the performance of the proposed algorithm with respect to previous works, achieving speedups of 30 in the improved phases, and second we evaluate the correctness of the clustering performed in real collected data, from which some tagged events were detected. In addition, the tweeting activity varies a lot within geographically large regions and temporarily long periods, limiting the use of global parameters needed for DSCAN-like algorithms. With that in mind, we also present a novel density density-aware MapReduce scheme that partitions tweet data as per its spatial and temporal features and tailors local DBSCAN parameters to local tweet densities. The results from evaluating this scheme pointed out to the benefits of our proposal against other state-of-the-art techniques in terms of speed-up and detection accuracy.