Decoding the exposome: understanding its influence on the molecular profile of lung cancer patients
Títol de la revista
ISSN de la revista
Títol del volum
Autors
Correu electrònic de l'autor
Tutor / director
Tribunal avaluador
Realitzat a/amb
Tipus de document
Data
Condicions d'accés
item.page.rightslicense
Publicacions relacionades
Datasets relacionats
Projecte CCD
Abstract
Accurately linking environmental and lifestyle exposures to molecular alterations could improve lung cancer's stratification and help to better understand what shapes this disease. However, standard tabular models struggle to represent the complex, multi-scale relationships among the different exposures, demographics, and familial risks. This thesis introduces LungCancerGNN, a heterogeneous temporal graph approach that encodes information into different granularity levels and propagates information with message-passing Graph Neural Networks (GNNs). We evaluate graph construction choices, relational operators (GCN, GATv2, Transformer-style), and training/calibration strategies (class reweighting, resampling, focal loss, and per-class threshold search) using stratified 5-fold cross-validation with an inner calibration split. Compared with non-graph baselines strategies (e.g., logistic regression, XGBoost, MLP), the best GNN (single-phase message passing with GATv2 layers and temporal exposure encoding) improved weighted F1 score from 0.51 to 0.56 and accuracy from 37% to 69%, in average. In addition, explainability analyses (i.e., attention + Integrated Gradients) has helped to identify the most important features: radon, active and secondhand tobacco exposure, and occupational exposure, where attention and Integrated Gradients show moderate concordance. We also discuss limitations of the work, ethical considerations, and future directions, including explicit Wild-Type modelling, incorporation of objective environmental measurements, and prospective clinical validation.

