UPCommons està en procés de migració del dia 10 fins al 14 Juliol. L’autentificació està deshabilitada per evitar canvis durant aquesta migració.
Effective data pre-processing for AutoML

View/Open
Cita com:
hdl:2117/344761
Document typeConference report
Defense date2021
PublisherCEUR-WS.org
Rights accessOpen Access
This work is protected by the corresponding intellectual and industrial property rights.
Except where otherwise noted, its contents are licensed under a Creative Commons license
:
Attribution 4.0 International
Abstract
Data pre-processing plays a key role in a data analytics process (e.g., supervised learning). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations (e,g., normalization, discretization, etc.) impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these pipelines can be optimized using AutoML in order to generate executable pipelines (i.e., with parametrized operators for each transformation). In this work, we study the impact of transformations in general, and the impact of transformations when combined together into pipelines. We develop a generic method that allows to find effective pipeline prototypes. Evaluated using Scikit-learn, our effective pipeline prototypes, when optimized, provide results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller.
CitationGiovanelli, J.; Bilalli, B.; Abelló, A. Effective data pre-processing for AutoML. A: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data. "Proceedings of the 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP): co-located with the 24th International Conference on Extending Database Technology and the 24th International Conference on Database Theory (EDBT/ICDT 2021): Nicosia, Cyprus, March 23, 2021". CEUR-WS.org, 2021, p. 1-10. ISSN 1613-0073.
ISSN1613-0073
Publisher versionhttp://ceur-ws.org/Vol-2840/paper1.pdf
Files | Description | Size | Format | View |
---|---|---|---|---|
Giovanelli et al.pdf | 2,499Mb | View/Open |