An integration data tool for joinable tables based on apache spark

Flores Herrera, Javier de Jesús

dc.contributor	Nadal Francesch, Sergi
dc.contributor	Romero Moral, Óscar
dc.contributor.author	Flores Herrera, Javier de Jesús
dc.contributor.other	Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
dc.date.accessioned	2021-01-21T11:47:51Z
dc.date.available	2021-01-21T11:47:51Z
dc.date.issued	2020-06-29
dc.identifier.uri	http://hdl.handle.net/2117/335717
dc.description.abstract	Data analysts perform exploratory programming for several analytical tasks on notebooks. One is Data Discovery which consists in finding attributes that might join. This is timeconsuming and new techniques are needed to provide joinable attributes and receive a speed-up to analyse data. Those attributes should produce high quality joins. We consider high quality joins those joins between attributes that share a high number of unique values. In this thesis, we aim to find quality joinable attributes by proposing a three-step approach: performing attribute profiling, classification and ranking. We create 5 categorical labels to represent the quality join that two attributes might have. One-vs-the-Rest strategy is used to create machine learning models. We aim at integrating data discovery with notebooks and well-known data management tools. We prototype our techniques on top of mature tools for exploratory and large-scale data processing, namely Jupyter and Apache Spark. We created four experiments with real datasets to validate our approach. Our experiments suggest our approach is a general approach for finding high quality joins for any topic. Our solution can reduce time for finding joinable attributes without having to perform a manual data exploration on multiple datasets
dc.language.iso	eng
dc.publisher	Universitat Politècnica de Catalunya
dc.subject	Àrees temàtiques de la UPC::Informàtica
dc.subject.lcsh	Big data
dc.subject.other	data discovery
dc.subject.other	data integration
dc.subject.other	attribute profiling
dc.subject.other	random forest
dc.subject.other	data fusion
dc.subject.other	joinable attributes
dc.subject.other	quality join
dc.subject.other	data discovery
dc.subject.other	data integration
dc.subject.other	attribute profiling
dc.subject.other	random forest
dc.subject.other	data fusion
dc.subject.other	joinable attributes
dc.subject.other	quality join
dc.title	An integration data tool for joinable tables based on apache spark
dc.type	Master thesis
dc.subject.lemac	Dades massives
dc.subject.lemac	Anàlisi de dades
dc.identifier.slug	152734
dc.rights.access	Open Access
dc.date.updated	2020-09-21T06:49:45Z
dc.audience.educationlevel	Màster
dc.audience.mediator	Facultat d'Informàtica de Barcelona
dc.audience.degree	MÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)

Fitxers d'aquest items

Nom:: 152734.pdf
Mida:: 1,814Mb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Master in Innovation and Research in Informatics - MIRI [454]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

An integration data tool for joinable tables based on apache spark

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora