Keeping the data lake in form: proximity mining for pre-filtering schema matching

Al-serafi, Ayman Mounir Mohamed; Abelló Gamazo, Alberto; Romero Moral, Óscar; Calders, Toon

doi:10.1145/3388870

Visualitza/Obre

Proximity_Mining_Holistic_Matching_Prefiltering-Alserafi_preprint-V20_final_version.pdf (1,889Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Al-serafi, Ayman Mounir Mohamed

Abelló Gamazo, Alberto

Romero Moral, Óscar

Calders, Toon

Tipus de documentArticle

Data publicació2020-05

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

Data Lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching. Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets which are proposed for further schema matching. We conduct extensive experiments on a real-world DL which proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.

CitacióAl-serafi, A. [et al.]. Keeping the data lake in form: proximity mining for pre-filtering schema matching. "ACM transactions on information systems", Maig 2020, vol. 38, núm. 3, article 26, p. 1-30.

URIhttp://hdl.handle.net/2117/189421

DOI10.1145/3388870

ISSN1046-8188

Versió de l'editorhttps://dl.acm.org/doi/abs/10.1145/3388870

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
Proximity_Minin ... rint-V20_final_version.pdf		1,889Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Keeping the data lake in form: proximity mining for pre-filtering schema matching

Visualitza/Obre

Explora