Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival

Vilardell, Mireia; Buxó, Maria; Clèries, Ramon; Martínez Martínez, José Miguel; Ameijide, Alberto

doi:10.1016/j.artmed.2020.101875

Visualitza/Obre

Vilardell M et al 2020.pdf (7,374Mb) (Accés restringit) Sol·licita una còpia a l'autor

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Vilardell, Mireia

Buxó, Maria

Clèries, Ramon

Martínez Martínez, José Miguel

Ameijide, Alberto

Tipus de documentArticle

Data publicació2020-07-01

Condicions d'accésAccés restringit per política de l'editorial

Attribution-NonCommercial-NoDerivs 3.0 Spain

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya

Abstract

Background Two common issues may arise in certain population-based breast cancer (BC) survival studies: I) missing values in a survivals’ predictive variable, such as “Stage” at diagnosis, and II) small sample size due to “imbalance class problem” in certain subsets of patients, demanding data modeling/simulation methods. Methods We present a procedure, ModGraProDep, based on graphical modeling (GM) of a dataset to overcome these two issues. The performance of the models derived from ModGraProDep is compared with a set of frequently used classification and machine learning algorithms (Missing Data Problem) and with oversampling algorithms (Synthetic Data Simulation). For the Missing Data Problem we assessed two scenarios: missing completely at random (MCAR) and missing not at random (MNAR). Two validated BC datasets provided by the cancer registries of Girona and Tarragona (northeastern Spain) were used. Results In both MCAR and MNAR scenarios all models showed poorer prediction performance compared to three GM models: the saturated one (GM.SAT) and two with penalty factors on the partial likelihood (GM.K1 and GM.TEST). However, GM.SAT predictions could lead to non-reliable conclusions in BC survival analysis. Simulation of a “synthetic” dataset derived from GM.SAT could be the worst strategy, but the use of the remaining GMs models could be better than oversampling. Conclusion Our results suggest the use of the GM-procedure presented for one-variable imputation/prediction of missing data and for simulating “synthetic” BC survival datasets. The “synthetic” datasets derived from GMs could be also used in clinical applications of cancer survival data such as predictive risk analysis.

CitacióVilardell, M. [et al.]. Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival. "Artificial intelligence in medicine", 1 Juliol 2020, vol. 107, p. 1018757/1-101875/11.

URIhttp://hdl.handle.net/2117/340631

DOI10.1016/j.artmed.2020.101875

ISSN0933-3657

Versió de l'editorhttps://www.sciencedirect.com/science/article/pii/S0933365719311595

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
Vilardell M et al 2020.pdf		7,374Mb	PDF	Accés restringit

UPCommons. Portal del coneixement obert de la UPC

Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival

Visualitza/Obre

Explora