Impact of filter feature selection on classification: an empirical study
Document typeConference lecture
Rights accessOpen Access
The high-dimensionality of Big Data poses challenges in data understanding and visualization. Furthermore, it leads to lengthy model building times in data analysis and poor generalization for machine learning models. Consequently, there is a need for feature selection, which allows identifying the more relevant part of the data to improve the data analysis (e.g., building simpler and more understandable models with reduced training time and improved model performance). This study aims to (i) characterize the factors (i.e., dataset characteristics) that influence the performance of feature selection methods, and (ii) assess the impact of feature selection on the training time and accuracy of binary and multiclass classification problems. As a result, we propose a systematic method to select representative datasets (i.e., considering the distributions of several dataset characteristics) in a given repository. Next, we provide an empirical study of the impact of eight feature selection methods on Naive Bayes (NB), Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), and Multilayer Perceptron (MLP) classification algorithms using 32 real-world datasets and a relative performance measure. We observed that feature selection is more effective in reducing training time (e.g., up to 60% for LDA classifiers) than improving classification accuracy (e.g., up to 5%). Furthermore, we observed that feature selection gives slight accuracy improvement for binary classification (i.e., up to 5%), while it mostly leads to accuracy degradation for multiclass classification. Although none of the studied feature selection methods is best in all cases, for multiclass classification, we observed that correlation based and minimum redundancy maximum relevance feature selection methods gave the best results in accuracy. Through statistical testing, we found LDA and MLP to benefit more in accuracy improvement after feature selection than KNN and NB.
CitationNjoku, U. [et al.]. Impact of filter feature selection on classification: an empirical study. A: International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data. "Proceedings of the 24rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP): co-located with the 24th International Conference on Extending Database Technology and the 24th International Conference on Database Theory (EDBT/ICDT 2022): Regne Unit, March 29, 2022". CEUR-WS.org, 2022, p. 71-80. ISSN 1613-0073.