Network and e-business centered computing
Document typeMaster thesis (pre-Bologna period)
Rights accessRestricted access - confidentiality agreement
Data Mining has experienced tremendous growth, as well as a large expansion and evolution in recent decades. The k-Means algorithm as been considered as one of the most widely used method applied to data clustering. However, the amount of data stored in databases has been growing and more complex data needs to be analyzed. For all these reasons, the original approach of the k-Means algorithm has result less efficient and poorly scalable. On the other hand, some researches on the field have shown some improvements that could be done to make this algorithm more suitable for clustering large scale data bases with a large number of data attributes. These improvements are basically focused on boosting the efficiency and scalability of the many different clustering algorithms. This dissertation project presents a report of the previous papers that have been recognized as relevant and effective to improve the efficiency and scalability of the k-means algorithm, specially applied to large scale data bases. After the research of all previous literature and previous k-means implementations, a solution to accelerate k-means will be presented. This solution will catch up with the previous improvements based on the reduction of the number of distance calculations using the triangle inequality theorem. Another part of the research will consist on comparing my implementation of the acceleration of the k-Means making use of the triangle inequality theorem with BIRCH, which is an award winning algorithm that received the SIGMOD 10 year test of time award. After executing and showing the experimental analysis with the comparison between the original k-Means algorithm, the triangle inequality acceleration and the BIRCH approach, the conclusions about the research will be deployed comparing the original expectations with the real results of the research, showing the successful results obtaining time speed-ups by the order of an average of 162% from the original algorithm. It also will be presented how future research could be done to improve even more the clustering algorithms and further work suggestions will be exposed to improve even more this kind of algorithms by mixing the different solutions tested on this dissertation to produce what is supposed to be a best than the others solution.