Mostra el registre d'ítem simple

dc.contributor.authorMedina Herrera, Salvador
dc.contributor.authorTurmo Borras, Jorge
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament de Ciències de la Computació
dc.date.accessioned2018-11-20T10:58:12Z
dc.date.available2018-11-20T10:58:12Z
dc.date.issued2018
dc.identifier.citationMedina, S., Turmo, J. Building a Spanish/Catalan health records corpus with very sparse protected information labelled. A: International Conference on Language Resources and Evaluation. "LREC 2018: Workshop MultilingualBIO: Multilingual Biomedical Text Processing: proceedings". 2018, p. 1-7.
dc.identifier.isbn979-10-95546-03-0
dc.identifier.urihttp://hdl.handle.net/2117/124710
dc.description.abstractElectronic Health Records (EHR) are an important resource for the research and study of diseases, treatments and symptoms. However, due to data protection laws, information that could potentially compromise privacy must be anonymized before making use of them. Thus, the identification of these pieces of information is mandatory. This identification is usually performed by linguistic models built from EHRs corpora in which Protected Health Information (PHI) has been previously annotated. Nevertheless, two main drawbacks can occur. First, the annotated corpora required to build the models for a particular language may not exist. Second, unannotated corpora might exist for that language, containing very few words related to PHI mentions (i.e., very sparse population). In this situation, the process of manually annotating EHRs results extremely hard and costly, as PHI occurs in very few EHRs. This paper proposes an iterative method for building corpus with labelled PHI from a large unlabelled corpus with a very sparse population of target PHI. The method makes use of manually defined rules specified in the form of Augmented Transition Networks, and tries to minimize the seek of EHRs containing PHI, thus minimizing the cost of manually annotating very sparse EHRs corpora. We use the method with primary care EHRs written in Spanish and Catalan, although it is language-independent and could be applied to EHRs written in other languages. Direct and indirect evaluations performed to the resulting labelled corpus show the appropriateness of our method.
dc.format.extent7 p.
dc.language.isoeng
dc.rightsAttribution-NonCommercial 3.0 Spain
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/es/
dc.subjectÀrees temàtiques de la UPC::Informàtica::Intel·ligència artificial
dc.subject.lcshMedical records--Data processing
dc.subject.otherAnonymization
dc.subject.otherHealth Records
dc.subject.otherSparse
dc.titleBuilding a Spanish/Catalan health records corpus with very sparse protected information labelled
dc.typeConference lecture
dc.subject.lemacHistòries clíniques--Informàtica
dc.contributor.groupUniversitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttp://www.elra.info/en/
dc.rights.accessOpen Access
local.identifier.drac23524989
dc.description.versionPostprint (published version)
local.citation.authorMedina, S.; Turmo, J.
local.citation.contributorInternational Conference on Language Resources and Evaluation
local.citation.publicationNameLREC 2018: Workshop MultilingualBIO: Multilingual Biomedical Text Processing: proceedings
local.citation.startingPage1
local.citation.endingPage7


Fitxers d'aquest items

Thumbnail

Aquest ítem apareix a les col·leccions següents

Mostra el registre d'ítem simple