Building a Spanish/Catalan health records corpus with very sparse protected information labelled

Medina Herrera, Salvador; Turmo Borras, Jorge

dc.contributor.author	Medina Herrera, Salvador
dc.contributor.author	Turmo Borras, Jorge
dc.contributor.other	Universitat Politècnica de Catalunya. Departament de Ciències de la Computació
dc.date.accessioned	2018-11-20T10:58:12Z
dc.date.available	2018-11-20T10:58:12Z
dc.date.issued	2018
dc.identifier.citation	Medina, S., Turmo, J. Building a Spanish/Catalan health records corpus with very sparse protected information labelled. A: International Conference on Language Resources and Evaluation. "LREC 2018: Workshop MultilingualBIO: Multilingual Biomedical Text Processing: proceedings". 2018, p. 1-7.
dc.identifier.isbn	979-10-95546-03-0
dc.identifier.uri	http://hdl.handle.net/2117/124710
dc.description.abstract	Electronic Health Records (EHR) are an important resource for the research and study of diseases, treatments and symptoms. However, due to data protection laws, information that could potentially compromise privacy must be anonymized before making use of them. Thus, the identification of these pieces of information is mandatory. This identification is usually performed by linguistic models built from EHRs corpora in which Protected Health Information (PHI) has been previously annotated. Nevertheless, two main drawbacks can occur. First, the annotated corpora required to build the models for a particular language may not exist. Second, unannotated corpora might exist for that language, containing very few words related to PHI mentions (i.e., very sparse population). In this situation, the process of manually annotating EHRs results extremely hard and costly, as PHI occurs in very few EHRs. This paper proposes an iterative method for building corpus with labelled PHI from a large unlabelled corpus with a very sparse population of target PHI. The method makes use of manually defined rules specified in the form of Augmented Transition Networks, and tries to minimize the seek of EHRs containing PHI, thus minimizing the cost of manually annotating very sparse EHRs corpora. We use the method with primary care EHRs written in Spanish and Catalan, although it is language-independent and could be applied to EHRs written in other languages. Direct and indirect evaluations performed to the resulting labelled corpus show the appropriateness of our method.
dc.format.extent	7 p.
dc.language.iso	eng
dc.rights	Attribution-NonCommercial 3.0 Spain
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/es/
dc.subject	Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial
dc.subject.lcsh	Medical records--Data processing
dc.subject.other	Anonymization
dc.subject.other	Health Records
dc.subject.other	Sparse
dc.title	Building a Spanish/Catalan health records corpus with very sparse protected information labelled
dc.type	Conference lecture
dc.subject.lemac	Històries clíniques--Informàtica
dc.contributor.group	Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	http://www.elra.info/en/
dc.rights.access	Open Access
local.identifier.drac	23524989
dc.description.version	Postprint (published version)
local.citation.author	Medina, S.; Turmo, J.
local.citation.contributor	International Conference on Language Resources and Evaluation
local.citation.publicationName	LREC 2018: Workshop MultilingualBIO: Multilingual Biomedical Text Processing: proceedings
local.citation.startingPage	1
local.citation.endingPage	7

Fitxers d'aquest items

Nom:: 6_W3(1).pdf
Mida:: 227,8Kb
Format:: PDF
Descripció:: Main article

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Ponències/Comunicacions de congressos [192]
Ponències/Comunicacions de congressos [1.274]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Building a Spanish/Catalan health records corpus with very sparse protected information labelled

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora