Effective data pre-processing for AutoML

Giovanelli, Joseph; Bilalli, Besim; Abelló Gamazo, Alberto

Visualitza/Obre

Giovanelli et al.pdf (2,499Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Giovanelli, Joseph

Bilalli, Besim

Abelló Gamazo, Alberto

Tipus de documentText en actes de congrés

Data publicació2021

EditorCEUR-WS.org

Condicions d'accésAccés obert

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement 4.0 Internacional

Abstract

Data pre-processing plays a key role in a data analytics process (e.g., supervised learning). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations (e,g., normalization, discretization, etc.) impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these pipelines can be optimized using AutoML in order to generate executable pipelines (i.e., with parametrized operators for each transformation). In this work, we study the impact of transformations in general, and the impact of transformations when combined together into pipelines. We develop a generic method that allows to find effective pipeline prototypes. Evaluated using Scikit-learn, our effective pipeline prototypes, when optimized, provide results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller.

CitacióGiovanelli, J.; Bilalli, B.; Abelló, A. Effective data pre-processing for AutoML. A: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data. "Proceedings of the 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP): co-located with the 24th International Conference on Extending Database Technology and the 24th International Conference on Database Theory (EDBT/ICDT 2021): Nicosia, Cyprus, March 23, 2021". CEUR-WS.org, 2021, p. 1-10. ISSN 1613-0073.

URIhttp://hdl.handle.net/2117/344761

ISSN1613-0073

Versió de l'editorhttp://ceur-ws.org/Vol-2840/paper1.pdf

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
Giovanelli et al.pdf		2,499Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Effective data pre-processing for AutoML

Visualitza/Obre

Explora