Show simple item record

dc.contributorNadal Francesch, Sergi
dc.contributorRomero Moral, Óscar
dc.contributor.authorFlores Herrera, Javier de Jesús
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
dc.date.accessioned2021-01-21T11:47:51Z
dc.date.available2021-01-21T11:47:51Z
dc.date.issued2020-06-29
dc.identifier.urihttp://hdl.handle.net/2117/335717
dc.description.abstractData analysts perform exploratory programming for several analytical tasks on notebooks. One is Data Discovery which consists in finding attributes that might join. This is timeconsuming and new techniques are needed to provide joinable attributes and receive a speed-up to analyse data. Those attributes should produce high quality joins. We consider high quality joins those joins between attributes that share a high number of unique values. In this thesis, we aim to find quality joinable attributes by proposing a three-step approach: performing attribute profiling, classification and ranking. We create 5 categorical labels to represent the quality join that two attributes might have. One-vs-the-Rest strategy is used to create machine learning models. We aim at integrating data discovery with notebooks and well-known data management tools. We prototype our techniques on top of mature tools for exploratory and large-scale data processing, namely Jupyter and Apache Spark. We created four experiments with real datasets to validate our approach. Our experiments suggest our approach is a general approach for finding high quality joins for any topic. Our solution can reduce time for finding joinable attributes without having to perform a manual data exploration on multiple datasets
dc.language.isoeng
dc.publisherUniversitat Politècnica de Catalunya
dc.subjectÀrees temàtiques de la UPC::Informàtica
dc.subject.lcshBig data
dc.subject.otherdata discovery
dc.subject.otherdata integration
dc.subject.otherattribute profiling
dc.subject.otherrandom forest
dc.subject.otherdata fusion
dc.subject.otherjoinable attributes
dc.subject.otherquality join
dc.subject.otherdata discovery
dc.subject.otherdata integration
dc.subject.otherattribute profiling
dc.subject.otherrandom forest
dc.subject.otherdata fusion
dc.subject.otherjoinable attributes
dc.subject.otherquality join
dc.titleAn integration data tool for joinable tables based on apache spark
dc.typeMaster thesis
dc.subject.lemacDades massives
dc.subject.lemacAnàlisi de dades
dc.identifier.slug152734
dc.rights.accessOpen Access
dc.date.updated2020-09-21T06:49:45Z
dc.audience.educationlevelMàster
dc.audience.mediatorFacultat d'Informàtica de Barcelona
dc.audience.degreeMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder