An investigation into new kernels for categorical variables
Tutor / director / evaluatorBelanche Muñoz, Luis Antonio
Document typeMaster thesis
Rights accessOpen Access
Kernel-based methods first appeared in the form of support vector machines. Since the first Support Vector Machine (SVM) formulation in 1995, we have seen how the number of proposed kernel functions has quickly grown, and how these kernels have approached a wide range of problems and domains. The most common and direct applications of these methods are focused on continuous numeric data, given that SVMs at the end involves the solution of an optimization problem. Additionally, some kernel functions have been oriented to more symbolic data, in problems like text analysis, or hand-written digits recognition. But surprisingly, there is a gap in the area of kernel functions devoted to handle datasets with qualitative variables. One of the most common practices to overcome this lack consists on recoding the source qualitative information, making them suitable for applying numeric kernel functions. This thesis presents the development of new kernel functions that can better model symbolic information presented as categorical variables, in a direct way, and without the need of data preprocessing methods. The proposition is based on the use of probabilistic information (probability mass distribution) to compare the different modalities of a variable. Additionally, the idea is formulated through a modular schema, combining a set of components to obtain the kernel functions, facilitating the modification and extension of single components. The experimental results suggest an slightly improvement with respect to traditional kernel functions, in the accuracy obtained on classification problems. This progress is clearer on datasets with known probabilistic structure.