Machine learning methods for the analysis of liquid chromatography-mass spectrometry datasets in metabolomics
ColaboratorLlorach Asunción, Rafael; Perera Lluna, Alexandre; Universitat Politècnica de Catalunya. Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial
Document typeDoctoral thesis
PublisherUniversitat Politècnica de Catalunya
Rights accessOpen Access
Tesi per compendi de publicacionsLiquid Chromatography-Mass Spectrometry (LC/MS) instruments are widely used in Metabolomics. To analyse their output, it is necessary to use computational tools and algorithms to extract meaningful biological information. The main goal of this thesis is to provide with new computational methods and tools to process and analyse LC/MS datasets in a metabolomic context. A total of 4 tools and methods were developed in the context of this thesis. First, it was developed a new method to correct possible non-linear drift effects in the retention time of the LC/MS data in Metabolomics, and it was coded as an R package called HCor. This method takes advantage of the retention time drift correlation found in typical LC/MS data, in which there are chromatographic regions in which their retention time drift is consistently different than other regions. Our method makes the hypothesis that this correlation structure is monotonous in the retention time and fits a non-linear model to remove the unwanted drift from the dataset. This method was found to perform especially well on datasets suffering from large drift effects when compared to other state-of-the art algorithms. Second, it was implemented and developed a new method to solve known issues of peak intensity drifts in metabolomics datasets. This method is based on a two-step approach in which are corrected possible intensity drift effects by modelling the drift and then the data is normalised using the median of the resulting dataset. The drift was modelled using a Common Principal Components Analysis decomposition on the Quality Control classes and taking one, two or three Common Principal Components to model the drift space. This method was compared to four other drift correction and normalisation methods. The two-step method was shown to perform a better intensity drift removal than all the other methods. All the tested methods including the two-step method were coded as an R package called intCor and it is publicly available. Third, a new processing step in the LC/MS data analysis workflow was proposed. In general, when LC/MS instruments are used in a metabolomic context, a metabolite may give a set of peaks as an output. However, the general approach is to consider each peak as a variable in the machine learning algorithms and statistical tests despite the important correlation structure found between those peaks coming from the same source metabolite. It was developed an strategy called peak aggregation techniques, that allow to extract a measure for each metabolite considering the intensity values of the peaks coming from this metabolite across the samples in study. If the peak aggregation techniques are applied on each metabolite, the result is a transformed dataset in which the variables are no longer the peaks but the metabolites. 4 different peak aggregation techniques were defined and, running a repeated random sub-sampling cross-validation stage, it was shown that the predictive power of the data was improved when the peak aggregation techniques were used regardless of the technique used. Fourth, a computational tool to perform end-to-end analysis called MAIT was developed and coded under the R environment. The MAIT package is highly modular and programmable which ease replacing existing modules for user-created modules and allow the users to perform their personalised LC/MS data analysis workflows. By default, MAIT takes the raw output files from an LC/MS instrument as an input and, by applying a set of functions, gives a metabolite identification table as a result. It also gives a set of figures and tables to allow for a detailed analysis of the metabolomic data. MAIT even accepts external peak data as an input. Therefore, the user can insert peak table obtained by any other available tool and MAIT can still perform all its other capabilities on this dataset like a classification or mining the Human Metabolome Dataset which is included in the package.
- Tesis - TDX-UPC