Pre-trained biomedical language models for clinical NLP in Spanish

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement 4.0 Internacional

Abstract

This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.

CitacióPio Carrino, C. [et al.]. Pre-trained biomedical language models for clinical NLP in Spanish. A: Workshop on Biomedical Language Processing. "Proceedings of the 21st Workshop on Biomedical Language Processing: Dublin, Ireland, 26 May 2022". Association for Computational Linguistics, 2022, p. 193-199. DOI 10.18653/v1/2022.bionlp-1.19.

URIhttp://hdl.handle.net/2117/374590

DOI10.18653/v1/2022.bionlp-1.19

Versió de l'editorhttps://aclanthology.org/2022.bionlp-1.19

Col·leccions

Life Sciences - Ponències/Comunicacions de congressos [18]

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
2022.bionlp-1.19.pdf		253,0Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Pre-trained biomedical language models for clinical NLP in Spanish

Visualitza/Obre

Explora