A factory of comparable corpora from Wikipedia

Barrón-Cedeño, Alberto; España Bonet, Cristina; Boldoba Trapote, Josu; Márquez Villodre, Luís

Visualitza/Obre

BUCC15Barronetal.pdf (244,4Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Barrón-Cedeño, Alberto

España Bonet, Cristina

Boldoba Trapote, Josu

Márquez Villodre, Luís

Tipus de documentText en actes de congrés

Data publicació2015

EditorAssociation for Computational Linguistics

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, we automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Our experiments on the English–Spanish pair in the domains of Computer Science, Science, and Sports show that our in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles. Moreover, we show that these corpora can help when translating out-of-domain texts

CitacióBarron-Cedeño, A., España-Bonet, C., Boldoba, J., Márquez , L. A factory of comparable corpora from Wikipedia. A: Workshop on Building and Using Comparable Corpora. "Proceedings of the Eighth Workshop on Building and Using Comparable Corpora". Beijing: Association for Computational Linguistics, 2015, p. 3-13.

URIhttp://hdl.handle.net/2117/76611

ISBN978-1-941643-60-0

Versió de l'editorhttp://aclweb.org/anthology/W/W15/W15-3402.pdf

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
BUCC15Barronetal.pdf		244,4Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

A factory of comparable corpora from Wikipedia

Visualitza/Obre

Explora