dc.contributor.author | Barrón-Cedeño, Alberto |
dc.contributor.author | España Bonet, Cristina |
dc.contributor.author | Boldoba Trapote, Josu |
dc.contributor.author | Márquez Villodre, Luís |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament de Ciències de la Computació |
dc.date.accessioned | 2015-09-04T08:34:56Z |
dc.date.available | 2015-09-04T08:34:56Z |
dc.date.issued | 2015 |
dc.identifier.citation | Barron-Cedeño, A., España-Bonet, C., Boldoba, J., Márquez , L. A factory of comparable corpora from Wikipedia. A: Workshop on Building and Using Comparable Corpora. "Proceedings of the Eighth Workshop on Building and Using Comparable Corpora". Beijing: Association for Computational Linguistics, 2015, p. 3-13. |
dc.identifier.isbn | 978-1-941643-60-0 |
dc.identifier.uri | http://hdl.handle.net/2117/76611 |
dc.description.abstract | Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward.
We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, we automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Our experiments on the English–Spanish pair in the domains of Computer Science, Science, and Sports show that our in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles.
Moreover, we show that these corpora can help when translating out-of-domain texts |
dc.format.extent | 11 p. |
dc.language.iso | eng |
dc.publisher | Association for Computational Linguistics |
dc.subject | Àrees temàtiques de la UPC::Informàtica |
dc.subject.lcsh | Computational linguistics |
dc.subject.lcsh | Wikipedia |
dc.subject.other | comparable corpora |
dc.subject.other | Wikipedia |
dc.subject.other | multilingual |
dc.subject.other | parallel corpora |
dc.subject.other | translation |
dc.title | A factory of comparable corpora from Wikipedia |
dc.type | Conference report |
dc.subject.lemac | Lingüística computacional -- Metodologia |
dc.contributor.group | Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural |
dc.description.peerreviewed | Peer Reviewed |
dc.relation.publisherversion | http://aclweb.org/anthology/W/W15/W15-3402.pdf |
dc.rights.access | Open Access |
local.identifier.drac | 16835606 |
dc.description.version | Postprint (published version) |
local.citation.author | Barron-Cedeño, A.; España-Bonet, C.; Boldoba, J.; Márquez, L. |
local.citation.contributor | Workshop on Building and Using Comparable Corpora |
local.citation.pubplace | Beijing |
local.citation.publicationName | Proceedings of the Eighth Workshop on Building and Using Comparable Corpora |
local.citation.startingPage | 3 |
local.citation.endingPage | 13 |