Strategies and algorithms for clustering large datasets: a review

Béjar Alonso, Javier

Visualitza/Obre

Article (479,8Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Béjar Alonso, Javier

Tipus de documentReport de recerca

Data publicació2013-10-28

Condicions d'accésAccés obert

Attribution-NonCommercial-NoDerivs 3.0 Spain

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya

Abstract

The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these kind of projects. More frequently these projects come from many different application areas like biology, text analysis, signal analysis, etc that involve larger and larger datasets in the number of examples and the number of attributes. Classical methods for clustering data like K-means or hierarchical clustering are beginning to reach its maximum capability to cope with this increase of dataset size. The limitation for these algorithms come either from the need of storing all the data in memory or because of their computational time complexity. These problems have opened an area for the search of algorithms able to reduce this data overload. Some solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples that represent an equivalent information. A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line processing, summarization, data distribution and efficient datastructures have being applied to the problem of scaling clustering algorithms. This paper presents a review of different strategies and clustering algorithms that apply these techniques. The aim is to cover the different range of methodologies applied for clustering data and how they can be scaled.

CitacióBejar, J. "Strategies and algorithms for clustering large datasets: a review". 2013.

Forma partLSI-13-11-R

URIhttp://hdl.handle.net/2117/23415

URL repositori externhttp://www.lsi.upc.edu/~techreps/files/R13-11.zip

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
R13-11.pdf	Article	479,8Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Strategies and algorithms for clustering large datasets: a review

Visualitza/Obre

Explora