Effective data pre-processing for AutoML
Document typeConference report
Rights accessOpen Access
Data pre-processing plays a key role in a data analytics process (e.g., supervised learning). It encompasses a broad range of activities that span from correcting errors to selecting the most relevant features for the analysis phase. There is no clear evidence, or rules defined, on how pre-processing transformations (e,g., normalization, discretization, etc.) impact the final results of the analysis. The problem is exacerbated when transformations are combined into pre-processing pipeline prototypes. Data scientists cannot easily foresee the impact of pipeline prototypes and hence require a method to discriminate between them and find the most relevant ones (e.g., with highest positive impact) for their study at hand. Once found, these pipelines can be optimized using AutoML in order to generate executable pipelines (i.e., with parametrized operators for each transformation). In this work, we study the impact of transformations in general, and the impact of transformations when combined together into pipelines. We develop a generic method that allows to find effective pipeline prototypes. Evaluated using Scikit-learn, our effective pipeline prototypes, when optimized, provide results that get 90% of the optimal predictive accuracy in the median, but with a cost that is 24 times smaller.
CitationGiovanelli, J.; Bilalli, B.; Abelló, A. Effective data pre-processing for AutoML. A: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data. "Proceedings of the 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP): co-located with the 24th International Conference on Extending Database Technology and the 24th International Conference on Database Theory (EDBT/ICDT 2021): Nicosia, Cyprus, March 23, 2021". CEUR-WS.org, 2021, p. 1-10. ISSN 1613-0073.