Ir al contenido (pulsa Retorno)

Universitat Politècnica de Catalunya

    • Català
    • Castellano
    • English
    • LoginRegisterLog in (no UPC users)
  • mailContact Us
  • world English 
    • Català
    • Castellano
    • English
  • userLogin   
      LoginRegisterLog in (no UPC users)

UPCommons. Global access to UPC knowledge

Banner header
76.526 UPC academic works
You are here:
View Item 
  •   DSpace Home
  • Treballs acadèmics
  • Màsters oficials
  • Master in Innovation and Research in Informatics - MIRI
  • View Item
  •   DSpace Home
  • Treballs acadèmics
  • Màsters oficials
  • Master in Innovation and Research in Informatics - MIRI
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

An integration data tool for joinable tables based on apache spark

Thumbnail
View/Open
152734.pdf (1,814Mb)
  View UPCommons Usage Statistics
  LA Referencia / Recolecta stats
Includes usage data since 2022
Cita com:
hdl:2117/335717

Show full item record
Flores Herrera, Javier de JesúsMés informacióMés informació
Tutor / directorNadal Francesch, SergiMés informacióMés informacióMés informació; Romero Moral, ÓscarMés informacióMés informacióMés informació
Document typeMaster thesis
Date2020-06-29
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder
Abstract
Data analysts perform exploratory programming for several analytical tasks on notebooks. One is Data Discovery which consists in finding attributes that might join. This is timeconsuming and new techniques are needed to provide joinable attributes and receive a speed-up to analyse data. Those attributes should produce high quality joins. We consider high quality joins those joins between attributes that share a high number of unique values. In this thesis, we aim to find quality joinable attributes by proposing a three-step approach: performing attribute profiling, classification and ranking. We create 5 categorical labels to represent the quality join that two attributes might have. One-vs-the-Rest strategy is used to create machine learning models. We aim at integrating data discovery with notebooks and well-known data management tools. We prototype our techniques on top of mature tools for exploratory and large-scale data processing, namely Jupyter and Apache Spark. We created four experiments with real datasets to validate our approach. Our experiments suggest our approach is a general approach for finding high quality joins for any topic. Our solution can reduce time for finding joinable attributes without having to perform a manual data exploration on multiple datasets
SubjectsBig data, Dades massives, Anàlisi de dades
DegreeMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)
URIhttp://hdl.handle.net/2117/335717
Collections
  • Màsters oficials - Master in Innovation and Research in Informatics - MIRI [494]
  View UPCommons Usage Statistics

Show full item record

FilesDescriptionSizeFormatView
152734.pdf1,814MbPDFView/Open

Browse

This CollectionBy Issue DateAuthorsOther contributionsTitlesSubjectsThis repositoryCommunities & CollectionsBy Issue DateAuthorsOther contributionsTitlesSubjects

© UPC Obrir en finestra nova . Servei de Biblioteques, Publicacions i Arxius

info.biblioteques@upc.edu

  • About This Repository
  • Metadata under:Metadata under CC0
  • Contact Us
  • Send Feedback
  • Privacy Settings
  • Inici de la pàgina