Màster universitari en Ciència de Dades (Pla 2021)
http://hdl.handle.net/2117/390017
2023-12-04T07:01:48ZDeep learning on genomics using NLP-oriented algorithms
http://hdl.handle.net/2117/396745
Deep learning on genomics using NLP-oriented algorithms
Moyano Gravalos, Carlos
This work provides a DNABERT-based solution inspired by NLP for the phage-bateria interaction (PBI) problem, as well as comprehensive literature review on the applied Deep Learning (DL) techniques. Intuitively, DNABERT is a novel approach to tackle DNA-based tasks built upon the popular BERT Transformer model. The report initially focuses on introducing and explaining core DL concepts that have been of paramount importance for developing the new ad-hoc models, providing also some background for understanding the current importance of Phage Therapy (PT) and how the PBI problem relates to PT. In brief, PT involves using phages to treat bacterial infections instead of antibiotics, and the focus of this work is to build a classifier based on DL capable of working with DNA information in text format from phages and bacteria, and of determining if a certain phage is likely to interact with (attack/neutralize) a certain bacterium. The first concept discussed is RNNs and, more concretely, LSTM networks. This type of neural network will be used as a starting point to reach the Transformer model by following the evolution of DL in the context of the NLP field. The attention mechanism and its implementation in the Transformer model are also discussed in detail. The document explains how this mechanism can be used to improve performance in various NLP tasks such as machine translation (MT), text summarization, and sentiment analysis, and, more importantly, how this mechanism can be also used to highlight the most relevant areas of a DNA sequence. This has been the main objective of the project, together with improving previous existing performance results of other neural models for the PBI problem. The motivation behind this research is also presented, along with important research questions that were explored and which guided the development, and there's an extensive section devoted to presenting the obtained results. Overall, this report aims to contribute to the field of Data Science by providing insights into how DL and NLP-oriented algorithms can be successfully applied to Genomics research for defeating great threats to human health such as superbugs. As a result of the developed work, 6 new DL models based on Transformers have been created, trained and evaluated, and the results have been satisfactory, achieving better results than previous neural models solving the same PBI problem. With the sake of encouraging future developments, at the end of the document there's a section devoted to pointing out possible promising future extensions of the work that has been carried out.
2023-11-20T16:09:17ZMoyano Gravalos, CarlosThis work provides a DNABERT-based solution inspired by NLP for the phage-bateria interaction (PBI) problem, as well as comprehensive literature review on the applied Deep Learning (DL) techniques. Intuitively, DNABERT is a novel approach to tackle DNA-based tasks built upon the popular BERT Transformer model. The report initially focuses on introducing and explaining core DL concepts that have been of paramount importance for developing the new ad-hoc models, providing also some background for understanding the current importance of Phage Therapy (PT) and how the PBI problem relates to PT. In brief, PT involves using phages to treat bacterial infections instead of antibiotics, and the focus of this work is to build a classifier based on DL capable of working with DNA information in text format from phages and bacteria, and of determining if a certain phage is likely to interact with (attack/neutralize) a certain bacterium. The first concept discussed is RNNs and, more concretely, LSTM networks. This type of neural network will be used as a starting point to reach the Transformer model by following the evolution of DL in the context of the NLP field. The attention mechanism and its implementation in the Transformer model are also discussed in detail. The document explains how this mechanism can be used to improve performance in various NLP tasks such as machine translation (MT), text summarization, and sentiment analysis, and, more importantly, how this mechanism can be also used to highlight the most relevant areas of a DNA sequence. This has been the main objective of the project, together with improving previous existing performance results of other neural models for the PBI problem. The motivation behind this research is also presented, along with important research questions that were explored and which guided the development, and there's an extensive section devoted to presenting the obtained results. Overall, this report aims to contribute to the field of Data Science by providing insights into how DL and NLP-oriented algorithms can be successfully applied to Genomics research for defeating great threats to human health such as superbugs. As a result of the developed work, 6 new DL models based on Transformers have been created, trained and evaluated, and the results have been satisfactory, achieving better results than previous neural models solving the same PBI problem. With the sake of encouraging future developments, at the end of the document there's a section devoted to pointing out possible promising future extensions of the work that has been carried out.Data discovery through profile-based similarity metrics
http://hdl.handle.net/2117/396651
Data discovery through profile-based similarity metrics
Maynou Yelamos, Marc
The most essential step in a data integration process is to find the datasets whose combined information provides relevant insights. This task, defined as data discovery, is highly dependent on the definition of the similarity between the candidate attributes to join, which commonly involves assessing the closeness of the semantic concepts that the two attributes represent. Most of the state-of-the-art approaches to this issue rely on syntactic methodologies, that is, procedures in which the instances of the two columns are compared to determine whether they are similar or not. These approaches suffice when the two sets of instances share the same syntactic representation but fail to detect cases in which the same semantic idea is represented by different sets of values. This latter case is ever-increasing in proportion, given the characteristics of big-data environments and the lack of standardization of the data. The aim of this project is to develop a system that can solve this issue and facilitate the establishment of relationships between related data that do not share a syntactic relationship. The approach presented in this work leverages the extensively studied syntactic methodologies to data discovery in conjunction with a new formulation for semantic similarity: the resemblance of probability distributions. Additionally, this system will be made scalable and able to handle vast quantities of data.
2023-11-17T15:25:25ZMaynou Yelamos, MarcThe most essential step in a data integration process is to find the datasets whose combined information provides relevant insights. This task, defined as data discovery, is highly dependent on the definition of the similarity between the candidate attributes to join, which commonly involves assessing the closeness of the semantic concepts that the two attributes represent. Most of the state-of-the-art approaches to this issue rely on syntactic methodologies, that is, procedures in which the instances of the two columns are compared to determine whether they are similar or not. These approaches suffice when the two sets of instances share the same syntactic representation but fail to detect cases in which the same semantic idea is represented by different sets of values. This latter case is ever-increasing in proportion, given the characteristics of big-data environments and the lack of standardization of the data. The aim of this project is to develop a system that can solve this issue and facilitate the establishment of relationships between related data that do not share a syntactic relationship. The approach presented in this work leverages the extensively studied syntactic methodologies to data discovery in conjunction with a new formulation for semantic similarity: the resemblance of probability distributions. Additionally, this system will be made scalable and able to handle vast quantities of data.Optimization through quantum computing
http://hdl.handle.net/2117/394532
Optimization through quantum computing
Iglesias Munilla, Andrea
This thesis explores the application of quantum computing techniques to solve Quadratic Unconstrained Binary Optimization problems, with a focus on the Unit Commitment problem. The thesis provides an introduction to quantum computing, including its mathematical foundation and the distinction between classical and quantum systems. It then discusses Variational Quantum Algorithms and explores various quantum computing platforms. Then a novel formulation of the Unit Commitment problem is presented, along with its implementation using the Qiskit library. The results obtained from the implementation are summarized, highlighting the process of using quantum computing for solving optimization problems.
2023-10-04T07:42:40ZIglesias Munilla, AndreaThis thesis explores the application of quantum computing techniques to solve Quadratic Unconstrained Binary Optimization problems, with a focus on the Unit Commitment problem. The thesis provides an introduction to quantum computing, including its mathematical foundation and the distinction between classical and quantum systems. It then discusses Variational Quantum Algorithms and explores various quantum computing platforms. Then a novel formulation of the Unit Commitment problem is presented, along with its implementation using the Qiskit library. The results obtained from the implementation are summarized, highlighting the process of using quantum computing for solving optimization problems.On the composition of neural and kernel layers for machine learning
http://hdl.handle.net/2117/394525
On the composition of neural and kernel layers for machine learning
Martorell Locascio, Alex
Deep Learning architectures in which neural layers alternate with mappings to infinitedimensional feature spaces have been proposed in recent years, showing improvements on the results obtained when using either technique separately. However, these new algorithms have been presented without delving into the rich mathematical structure that sustains kernel methods. The main focus of this thesis is not only to review these advances in the field of Deep Learning, but to extend and generalize them by defining a broader family of models that operate under the mathematical framework defined by the composition of a neural layerwith a kernel mapping, all of which operate in reproducing kernel Hilbert spaces thatare then concatenated. Each of these spaces has a specific reproducing kernel that we can characterize. Together all of this defines a regularization-based learning optimization problem, for which we prove that minimizers exist. This strong mathematical background is complemented by the presentation of a new a model, the Kernel Network, which manages to produce successful results on many classification problems.
2023-10-04T07:08:28ZMartorell Locascio, AlexDeep Learning architectures in which neural layers alternate with mappings to infinitedimensional feature spaces have been proposed in recent years, showing improvements on the results obtained when using either technique separately. However, these new algorithms have been presented without delving into the rich mathematical structure that sustains kernel methods. The main focus of this thesis is not only to review these advances in the field of Deep Learning, but to extend and generalize them by defining a broader family of models that operate under the mathematical framework defined by the composition of a neural layerwith a kernel mapping, all of which operate in reproducing kernel Hilbert spaces thatare then concatenated. Each of these spaces has a specific reproducing kernel that we can characterize. Together all of this defines a regularization-based learning optimization problem, for which we prove that minimizers exist. This strong mathematical background is complemented by the presentation of a new a model, the Kernel Network, which manages to produce successful results on many classification problems.Comparing Algorithms for Predictive Data Analytics
http://hdl.handle.net/2117/394524
Comparing Algorithms for Predictive Data Analytics
Kirov, Goran
The master's degree thesis is composed of theoretical and practical parts. The theoretical part describes the basics of predictive data analytics and machine learning algorithms for classification such as Logistic Regression, Decision Tree, Random Forest, SVM, and KNN. We also describe different evaluation metrics such as Recall, Precision, Accuracy, F1 Score, Cohen's Kappa, Hamming Loss, and Jaccard Index that are used to measure the performance of these algorithms. Additionally, we record the time taken for the training and prediction processes to provide insights into algorithm scalability. The key part master's thesis is the practical part that compares these algorithms with a self-implemented tool that shows results for different evaluation metrics on seven datasets. First, we describe the implementation of an application for testing where we measure evaluation metrics scores. We tested these algorithms on all seven datasets using Python libraries such as scikit-learn. Finally, we analyze the results obtained and provide final conclusions.
2023-10-04T07:08:16ZKirov, GoranThe master's degree thesis is composed of theoretical and practical parts. The theoretical part describes the basics of predictive data analytics and machine learning algorithms for classification such as Logistic Regression, Decision Tree, Random Forest, SVM, and KNN. We also describe different evaluation metrics such as Recall, Precision, Accuracy, F1 Score, Cohen's Kappa, Hamming Loss, and Jaccard Index that are used to measure the performance of these algorithms. Additionally, we record the time taken for the training and prediction processes to provide insights into algorithm scalability. The key part master's thesis is the practical part that compares these algorithms with a self-implemented tool that shows results for different evaluation metrics on seven datasets. First, we describe the implementation of an application for testing where we measure evaluation metrics scores. We tested these algorithms on all seven datasets using Python libraries such as scikit-learn. Finally, we analyze the results obtained and provide final conclusions.Kernel methods with mixed data types and their applications
http://hdl.handle.net/2117/394523
Kernel methods with mixed data types and their applications
Arqué Martínez, Arnau
Support Vector Machines (SVMs) represent a category of supervised machine learning algorithms that find extensive application in both classification and regression tasks. In these algorithms, kernel functions are responsible for measuring the similarity between input samples to generate models and perform predictions. In order for SVMs to tackle data analysis tasks involving mixed data, the implementation of a valid kernel function for this purpose is required. However, in the current literature, we hardly find any kernel function specifically designed to measure similarity between mixed data. In addition, there is a complete lack of significant examples where these kernels have been practically implemented. Another notable characteristic of SVMs is their remarkable efficacy in addressing high-dimensional problems. However, they can become inefficient when dealing with large volumes of data. In this project, we propose the formulation of a kernel function capable of accurately capturing the similarity between samples of mixed data. We also present an SVM algorithm based on Bagging techniques that enables efficient analysis of large volumes of data. Additionally, we implement both proposals in an updated version of the successful SVM library LIBSVM. Moreover, we evaluate their effectiveness, robustness and efficiency, obtaining promising results.
2023-10-04T07:08:03ZArqué Martínez, ArnauSupport Vector Machines (SVMs) represent a category of supervised machine learning algorithms that find extensive application in both classification and regression tasks. In these algorithms, kernel functions are responsible for measuring the similarity between input samples to generate models and perform predictions. In order for SVMs to tackle data analysis tasks involving mixed data, the implementation of a valid kernel function for this purpose is required. However, in the current literature, we hardly find any kernel function specifically designed to measure similarity between mixed data. In addition, there is a complete lack of significant examples where these kernels have been practically implemented. Another notable characteristic of SVMs is their remarkable efficacy in addressing high-dimensional problems. However, they can become inefficient when dealing with large volumes of data. In this project, we propose the formulation of a kernel function capable of accurately capturing the similarity between samples of mixed data. We also present an SVM algorithm based on Bagging techniques that enables efficient analysis of large volumes of data. Additionally, we implement both proposals in an updated version of the successful SVM library LIBSVM. Moreover, we evaluate their effectiveness, robustness and efficiency, obtaining promising results.Precision and power grip detection in egocentric hand-object Interaction using machine learning
http://hdl.handle.net/2117/394369
Precision and power grip detection in egocentric hand-object Interaction using machine learning
Huapaya Sierra, Rodrigo Arian
This project, was carried out in Yverdon-les-Bains, Switzerland, between the University of Applied Sciences and Arts Western Switzerland (HEIG-VD / HES-SO) and the Centre Hospitalier Universitaire Vaudois (CHUV) in Lausanne, it focuses on the detection of grasp types from an egocentric point of view. The objective is to accurately determine the kind of grasp (power, precision and none) performed by a user based on images captured from their perspective. The successful implementation of this grasp detection system would greatly benefit the evaluation of patients undergoing upper limb rehabilitation. Various computer vision frameworks were utilized to detect hands, interacting objects, and depth information in the images. These extracted features were then fed into deep learning models for grasp prediction. Both custom recorded datasets and open-source datasets, such as EpicKitchen and the Yale dataset, were employed for training and evaluation. In conclusion, this project achieved satisfactory results in the detection of grasp types from an egocentric viewpoint, with a 0.76 F1-macro score in the final test set. The utilization of diverse videos, including custom recordings and publicly available datasets, facilitated comprehensive training and evaluation. A robust pipeline was developed through iterative refinement, enabling the extraction of crucial features from each frame to predict grasp types accurately. Furthermore, data mixtures were proposed to enhance dataset size and improve the generalization performance of the models, which played a crucial role in the project's final stages.
2023-10-02T15:40:45ZHuapaya Sierra, Rodrigo ArianThis project, was carried out in Yverdon-les-Bains, Switzerland, between the University of Applied Sciences and Arts Western Switzerland (HEIG-VD / HES-SO) and the Centre Hospitalier Universitaire Vaudois (CHUV) in Lausanne, it focuses on the detection of grasp types from an egocentric point of view. The objective is to accurately determine the kind of grasp (power, precision and none) performed by a user based on images captured from their perspective. The successful implementation of this grasp detection system would greatly benefit the evaluation of patients undergoing upper limb rehabilitation. Various computer vision frameworks were utilized to detect hands, interacting objects, and depth information in the images. These extracted features were then fed into deep learning models for grasp prediction. Both custom recorded datasets and open-source datasets, such as EpicKitchen and the Yale dataset, were employed for training and evaluation. In conclusion, this project achieved satisfactory results in the detection of grasp types from an egocentric viewpoint, with a 0.76 F1-macro score in the final test set. The utilization of diverse videos, including custom recordings and publicly available datasets, facilitated comprehensive training and evaluation. A robust pipeline was developed through iterative refinement, enabling the extraction of crucial features from each frame to predict grasp types accurately. Furthermore, data mixtures were proposed to enhance dataset size and improve the generalization performance of the models, which played a crucial role in the project's final stages.