Data engineering for data science: two sides of the same coin

Romero Moral, Óscar; Wrembel, Robert

doi:10.1007/978-3-030-59065-9\_13

Visualitza/Obre

main.pdf (314,5Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Romero Moral, Óscar

Wrembel, Robert

Tipus de documentText en actes de congrés

Data publicació2020

EditorSpringer

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.

CitacióRomero, O.; Wrembel, R. Data engineering for data science: two sides of the same coin. A: International Conference on Big Data Analytics and Knowledge Discovery. "Big Data Analytics and Knowledge Discovery, 22nd International Conference, DaWaK 2020: Bratislava, Slovakia, September 14-17, 2020: proceedings". Springer, 2020, p. 157-166. ISBN 978-3-030-59065-9. DOI 10.1007/978-3-030-59065-9\_13.

URIhttp://hdl.handle.net/2117/340117

DOI10.1007/978-3-030-59065-9\_13

ISBN978-3-030-59065-9

Versió de l'editorhttps://link.springer.com/chapter/10.1007%2F978-3-030-59065-9_13

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
main.pdf		314,5Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Data engineering for data science: two sides of the same coin

Visualitza/Obre

Explora