Ir al contenido (pulsa Retorno)

Universitat Politècnica de Catalunya

    • Català
    • Castellano
    • English
    • LoginRegisterLog in (no UPC users)
  • mailContact Us
  • world English 
    • Català
    • Castellano
    • English
  • userLogin   
      LoginRegisterLog in (no UPC users)

UPCommons. Global access to UPC knowledge

Banner header
59.660 UPC E-Prints
You are here:
View Item 
  •   DSpace Home
  • E-prints
  • Grups de recerca
  • GPLN - Grup de Processament del Llenguatge Natural
  • Ponències/Comunicacions de congressos
  • View Item
  •   DSpace Home
  • E-prints
  • Grups de recerca
  • GPLN - Grup de Processament del Llenguatge Natural
  • Ponències/Comunicacions de congressos
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

A factory of comparable corpora from Wikipedia

Thumbnail
View/Open
BUCC15Barronetal.pdf (244,4Kb)
Share:
 
  View Usage Statistics
Cita com:
hdl:2117/76611

Show full item record
Barrón-Cedeño, Alberto
España Bonet, Cristina
Boldoba Trapote, Josu
Márquez Villodre, Luís
Document typeConference report
Defense date2015
PublisherAssociation for Computational Linguistics
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder
Abstract
Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the value of the model, we automatically extract parallel sentences from the comparable collections and use them to train statistical machine translation engines for specific domains. Our experiments on the English–Spanish pair in the domains of Computer Science, Science, and Sports show that our in-domain translator performs significantly better than a generic one when translating in-domain Wikipedia articles. Moreover, we show that these corpora can help when translating out-of-domain texts
CitationBarron-Cedeño, A., España-Bonet, C., Boldoba, J., Márquez , L. A factory of comparable corpora from Wikipedia. A: Workshop on Building and Using Comparable Corpora. "Proceedings of the Eighth Workshop on Building and Using Comparable Corpora". Beijing: Association for Computational Linguistics, 2015, p. 3-13. 
URIhttp://hdl.handle.net/2117/76611
ISBN978-1-941643-60-0
Publisher versionhttp://aclweb.org/anthology/W/W15/W15-3402.pdf
Collections
  • GPLN - Grup de Processament del Llenguatge Natural - Ponències/Comunicacions de congressos [192]
  • Departament de Ciències de la Computació - Ponències/Comunicacions de congressos [1.231]
Share:
 
  View Usage Statistics

Show full item record

FilesDescriptionSizeFormatView
BUCC15Barronetal.pdf244,4KbPDFView/Open

Browse

This CollectionBy Issue DateAuthorsOther contributionsTitlesSubjectsThis repositoryCommunities & CollectionsBy Issue DateAuthorsOther contributionsTitlesSubjects

© UPC Obrir en finestra nova . Servei de Biblioteques, Publicacions i Arxius

info.biblioteques@upc.edu

  • About This Repository
  • Contact Us
  • Send Feedback
  • Privacy Settings
  • Inici de la pàgina