Transformer-based kernels for Natural Language Processing
Tutor / directorBelanche Muñoz, Luis Antonio
Document typeMaster thesis
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder
Kernel methods (KM) and Artificial Neural Networks (ANN) are two of the main families of methods that are more widely spread nowadays in the world of Machine Learning, being the Support Vector Machine (SVM) and the Deep Learning Neural Network (DLNN) its most known members. One of the main benefits of both families is how they can be adapted to work with different kinds of input data, even though they do it in a different way: KM rely on different kernel functions and ANN rely on specific network architectures. In the specific world of Natural Language Processing, both families have specific resources to deal with texts as inputs. While KM rely on specific kernels like the string kernel, ANN rely on Recurrent Neural Network (RNN) architectures and, more recently, on Transformer architectures. Even though KM were the first ones to dive into the NLP world, ANN are nowadays the state-of-the-art in all kinds of NLP tasks. The main reason of this success is the fact that ANN perform as a nonlinear parametric feature extractor, which processes the text in a complex way that lets the model learn an adequate representation of the input. On the other side, kernel functions do only rely on fixed similarity measures to contain that information, which is less flexible and powerful. This is the reason why we developed a new kernel function, called Transformer kernel. This kernel function is a parametric model like an ANN, but instead of computing a nonlinear representation of the input, it directly computes a similarity measure that corresponds to an inner product in some implicit feature space. The main idea of this model is that it can be pre-trained in a generic task using the whole Wikipedia corpus (like the different Transformer models for NLP tasks), but the resulting model can be directly used as a kernel function for any kind of kernel method (like the SVM). In order to do so, we build this model by stacking two different components: A Transformer Encoder block (based on the BERT architecture) and a newly designed kernel function called Attention Kernel. By using this kernel function with real text datasets, we have shown that it provides results that are much better than the ones obtained by the use of the string kernel. We have also seen that, due to the fact that our model is built as a Neural Network and can be executed on a GPU, the Transformer kernel can compute the kernel matrix of a dataset much faster than the classical string kernel. Results are very promising, and suggest that further study of the behaviour of this kernel function with different datasets for pre-training and different final tasks will be very interesting. Moreover, it has to be considered if it is possible to adapt this kernel in order to combine the text input with other kinds of data (like numerical or categorical variables), or if the kernel can be fine-tuned into a specific task before it is used as a kernel function.
SubjectsKernel functions, Neural networks (Computer science), Kernel, Funcions de, Xarxes neuronals (Informàtica)
DegreeMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)