Ir al contenido (pulsa Retorno)

Universitat Politècnica de Catalunya

    • Català
    • Castellano
    • English
    • LoginRegisterLog in (no UPC users)
  • mailContact Us
  • world English 
    • Català
    • Castellano
    • English
  • userLogin   
      LoginRegisterLog in (no UPC users)

UPCommons. Global access to UPC knowledge

57.066 UPC E-Prints
You are here:
View Item 
  •   DSpace Home
  • E-prints
  • Grups de recerca
  • GESSI - Grup d'Enginyeria del Software i dels Serveis
  • Ponències/Comunicacions de congressos
  • View Item
  •   DSpace Home
  • E-prints
  • Grups de recerca
  • GESSI - Grup d'Enginyeria del Software i dels Serveis
  • Ponències/Comunicacions de congressos
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

DS-Prox : dataset proximity mining for governing the data lake

Thumbnail
View/Open
DS-Prox_SISAP_Paper-Camera_Ready.pdf (1,495Mb)
Share:
 
 
10.1007/978-3-319-68474-1_20
 
  View Usage Statistics
Cita com:
hdl:2117/117036

Show full item record
Al-serafi, Ayman Mounir MohamedMés informació
Calders, Toon
Abelló Gamazo, AlbertoMés informacióMés informacióMés informació
Romero Moral, ÓscarMés informacióMés informacióMés informació
Document typeConference report
Defense date2017
PublisherSpringer
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder
Abstract
With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
CitationAl-serafi, A., Calders, T., Abello, A., Romero, O. DS-Prox : dataset proximity mining for governing the data lake. A: The International Conference on Similarity Search and Applications. "Similarity Search and Applications: 10th International Conference, SISAP 2017: Munich, Germany, October 4-6, 2017: proceedings". Berlín: Springer, 2017, p. 284-299. 
URIhttp://hdl.handle.net/2117/117036
DOI10.1007/978-3-319-68474-1_20
ISBN978-3-319-68474-1
Publisher versionhttps://link.springer.com/chapter/10.1007/978-3-319-68474-1_20
Collections
  • GESSI - Grup d'Enginyeria del Software i dels Serveis - Ponències/Comunicacions de congressos [197]
  • Departament d'Enginyeria de Serveis i Sistemes d'Informació - Ponències/Comunicacions de congressos [485]
Share:
 
  View Usage Statistics

Show full item record

FilesDescriptionSizeFormatView
DS-Prox_SISAP_Paper-Camera_Ready.pdf1,495MbPDFView/Open

Browse

This CollectionBy Issue DateAuthorsOther contributionsTitlesSubjectsThis repositoryCommunities & CollectionsBy Issue DateAuthorsOther contributionsTitlesSubjects

© UPC Obrir en finestra nova . Servei de Biblioteques, Publicacions i Arxius

info.biblioteques@upc.edu

  • About This Repository
  • Contact Us
  • Send Feedback
  • Inici de la pàgina