Understanding complex predictive models with ghost variables

Delicado Useros, Pedro Francisco; Peña Sanchez de Rivera, Daniel

doi:10.1007/s11749-022-00826-x

dc.contributor.author	Delicado Useros, Pedro Francisco
dc.contributor.author	Peña Sanchez de Rivera, Daniel
dc.contributor.other	Universitat Politècnica de Catalunya. Departament d'Estadística i Investigació Operativa
dc.date.accessioned	2023-02-15T13:48:58Z
dc.date.available	2023-02-15T13:48:58Z
dc.date.issued	2022-08-24
dc.identifier.citation	Delicado, P.; Peña, D. Understanding complex predictive models with ghost variables. "Test", 24 Agost 2022, vol. 32; núm. 1; p. 107–145
dc.identifier.issn	1863-8260
dc.identifier.uri	http://hdl.handle.net/2117/383386
dc.description	The version of record of this article, first published in Test, is available online at Publisher’s website: http://dx.doi.org/10.1007/s11749-022-00826-x
dc.description.abstract	Framed in the literature on Interpretable Machine Learning, we propose a new procedure to assign a measure of relevance to each explanatory variable in a complex predictive model. We assume that we have a training set to fit the model and a test set to check its out-of-sample performance. We propose to measure the individual relevance of each variable by comparing the predictions of the model in the test set with those obtained when the variable of interest is substituted (in the test set) by its ghost variable, defined as the prediction of this variable by using the rest of explanatory variables. In linear models it is shown that, on the one hand, the proposed measure gives similar results to leave-one-covariate-out (loco, with a lowest computational cost) and outperforms random permutations, and on the other hand, it is strongly related to the usual F-statistic measuring the significance of a variable. In nonlinear predictive models (as neural networks or random forests) the proposed measure shows the relevance of the variables in an efficient way, as shown by a simulation study comparing ghost variables with other alternative methods (including loco and random permutations, and also knockoff variables and estimated conditional distributions). Finally, we study the joint relevance of the variables by defining the relevance matrix as the covariance matrix of the vectors of effects on predictions when using every ghost variable. Our proposal is illustrated with simulated examples and the analysis of a large real data set.
dc.language.iso	eng
dc.publisher	Springer
dc.rights	Attribution 4.0 International
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Àrees temàtiques de la UPC::Matemàtiques i estadística::Anàlisi matemàtica
dc.subject.lcsh	Mathematical statistics
dc.subject.other	Explainable artificial intelligence
dc.subject.other	Estimated conditional distributions
dc.subject.other	Interpretable machine learning
dc.subject.other	Knockoffs
dc.subject.other	Leave-one-covariate-out
dc.subject.other	Out-of-sample prediction
dc.subject.other	Partial correlation matrix
dc.subject.other	Random permutations
dc.title	Understanding complex predictive models with ghost variables
dc.type	Article
dc.subject.lemac	Estadística matemàtica
dc.contributor.group	Universitat Politècnica de Catalunya. ADBD - Anàlisi de Dades Complexes per a les Decisions Empresarials
dc.identifier.doi	10.1007/s11749-022-00826-x
dc.description.peerreviewed	Peer Reviewed
dc.subject.ams	Classificació AMS::62 Statistics::62G Nonparametric inference
dc.subject.ams	Classificació AMS::68 Computer science::68T Artificial intelligence
dc.relation.publisherversion	https://link.springer.com/article/10.1007/s11749-022-00826-x
dc.rights.access	Open Access
local.identifier.drac	34221675
dc.description.version	Postprint (author's final draft)
dc.relation.projectid	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/MTM2017-88142-P/ES/ESTRECHANDO LA BRECHA ENTRE LA ESTADISTICA Y LA CIENCIA DE DATOS/
dc.relation.projectid	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-116294GB-I00/ES/ESTADISTICA AVANZADA Y CIENCIA DE DATOS: INTERPRETANDO MODELOS CAJA-NEGRA Y ANALIZANDO CONJUNTOS DE DATOS GRANDES Y COMPLEJOS/
local.citation.author	Delicado, P.; Peña, D.
local.citation.publicationName	Test

Fitxers d'aquest items

Nom:: Relevance_matrix_TEST_authors_ ...
Mida:: 841,5Kb
Format:: PDF
Descripció:: Authors' version of the paper

Visualitza/Obre

Nom:: Relevance_matrix_TEST_Suppls.pdf
Mida:: 401,6Kb
Format:: PDF
Descripció:: Supplements

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Articles de revista [124]
Articles de revista [719]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Understanding complex predictive models with ghost variables

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora