Linguistic knowledge-based vocabularies for Neural Machine Translation

Casas Manzanares, Noé; Ruiz Costa-Jussà, Marta; Rodríguez Fonollosa, José Adrián; Alonso, Juan; Fanlo, Ramon

doi:10.1017/S1351324920000364

Visualitza/Obre

2020nle_linguistic_vocabs.pdf (884,5Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Casas Manzanares, Noé

Ruiz Costa-Jussà, Marta

Rodríguez Fonollosa, José Adrián

Alonso, Juan

Fanlo, Ramon

Tipus de documentArticle

Data publicació2020

EditorCambridge University Press

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

ProjecteTECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO (MINECO-TEC2015-69266-P)
AUTONOMOUS LIFELONG LEARNING INTELLIGENT SYSTEMS (AEI-PCIN-2017-079)

Abstract

Neural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for the inability to apply word-level token associations, which limits their use in semantically-rich areas and prevents some transfer learning approaches e.g. cross-lingual pretrained embeddings, and reduces their interpretability. In this work, we propose new hybrid linguistically-grounded vocabulary definition strategies that keep both the advantages of subword vocabularies and the word-level associations, enabling neural networks to profit from the derived benefits. We test the proposed approaches in both morphologically rich and poor languages, showing that, for the former, the quality in the translation of out-of-domain texts is improved with respect to a strong subword baseline.

Descripció

This article has been published in a revised form in Natural Language Engineering https://doi.org/10.1017/S1351324920000364. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © Cambridge University Press

CitacióCasas, N. [et al.]. Linguistic knowledge-based vocabularies for Neural Machine Translation. "Natural language engineering", 2020, p. 1-22.

URIhttp://hdl.handle.net/2117/330835

DOI10.1017/S1351324920000364

ISSN1469-8110

Versió de l'editorhttps://www.cambridge.org/core/journals/natural-language-engineering/article/linguistic-knowledgebased-vocabularies-for-neural-machine-translation/C1FAB80C1D6ADCD252EB627BA3B4082B

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
2020nle_linguistic_vocabs.pdf		884,5Kb	PDF	Visualitza/Obre
2020nle_linguistic_vocabs.pdf		884,5Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Linguistic knowledge-based vocabularies for Neural Machine Translation

Visualitza/Obre

Explora