Word-sense disambiguated multilingual Wikipedia corpus

Reese, Samuel; Boleda Torrent, Gemma; Cuadros Oller, Montserrat; Padró, Lluís; Rigau Claramunt, German

dc.contributor.author	Reese, Samuel
dc.contributor.author	Boleda Torrent, Gemma
dc.contributor.author	Cuadros Oller, Montserrat
dc.contributor.author	Padró, Lluís
dc.contributor.author	Rigau Claramunt, German
dc.contributor.other	Universitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics
dc.date.accessioned	2010-06-07T11:47:22Z
dc.date.available	2010-06-07T11:47:22Z
dc.date.created	2010-05
dc.date.issued	2010-05
dc.identifier.citation	Reese, S. [et al.]. Word-sense disambiguated multilingual Wikipedia corpus. A: International Conference on Language Resources and Evaluation. "7th International Conference on Language Resources and Evaluation". La Valetta: 2010.
dc.identifier.uri	http://hdl.handle.net/2117/7551
dc.description.abstract	This article presents a new freely available trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia and has been automatically enriched with linguistic information. To our knowledge, this is the largest such corpus that is freely available to the community: In its present version, it contains over 750 million words. The corpora have been annotated with lemma and part of speech information using the open source library FreeLing. Also, they have been sense annotated with the state of the art Word Sense Disambiguation algorithm UKB. As UKB assignsWordNet senses, andWordNet has been aligned across languages via the InterLingual Index, this sort of annotation opens the way to massive explorations in lexical semantics that were not possible before. We present a first attempt at creating a trilingual lexical resource from the sense-tagged Wikipedia corpora, namely, WikiNet. Moreover, we present two by-products of the project that are of use for the NLP community: An open source Java-based parser for Wikipedia pages developed for the construction of the corpus, and the integration of the WSD algorithm UKB in FreeLing.
dc.format.extent	1 p.
dc.language.iso	eng
dc.subject.lcsh	Natural language processing (Computer science)
dc.subject.lcsh	Wikipedia
dc.title	Word-sense disambiguated multilingual Wikipedia corpus
dc.type	Conference report
dc.subject.lemac	Processament de la parla
dc.subject.lemac	Wikipedia
dc.contributor.group	Universitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	http://www.lrec-conf.org/proceedings/lrec2010/pdf/222_Paper.pdf
dc.rights.access	Open Access
local.identifier.drac	2544129
dc.description.version	Postprint (published version)
local.citation.author	Reese, S.; Boleda, G.; Cuadros, M.; Padró, L.; Rigau, G.
local.citation.contributor	International Conference on Language Resources and Evaluation
local.citation.pubplace	La Valetta
local.citation.publicationName	7th International Conference on Language Resources and Evaluation

Fitxers d'aquest items

Nom:: 222_Paper.pdf
Mida:: 378,6Kb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Ponències/Comunicacions de congressos [192]
Ponències/Comunicacions de congressos [1.274]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Word-sense disambiguated multilingual Wikipedia corpus

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora