Mostra el registre d'ítem simple

dc.contributor.authorSánchez-Marco, Cristina
dc.contributor.authorBoleda Torrent, Gemma
dc.contributor.authorFontana, Josep Maria
dc.contributor.authorDomingo, Judith
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament de Llenguatges i Sistemes Informàtics
dc.date.accessioned2010-11-23T09:50:14Z
dc.date.available2010-11-23T09:50:14Z
dc.date.created2010
dc.date.issued2010
dc.identifier.citationSánchez-Marco, C. [et al.]. Annotation and representation of a diachronic corpus of Spanish. A: International Conference on Language Resources and Evaluation. "International Conference on Language Resources and Evaluation (LREC 2010)". 2010.
dc.identifier.isbn2-9517408-6-7
dc.identifier.urihttp://hdl.handle.net/2117/10373
dc.description.abstractIn this article we describe two different strategies for the automatic tagging of a Spanish diachronic corpus involving the adaptation of existing NLP tools developed for modern Spanish. In the initial approach we follow a state-of-the-art strategy, which consists on standardizing the spelling and the lexicon. This approach boosts POS-tagging accuracy to 90, which represents a raw improvement of over 20% with respect to the results obtained without any pre-processing. In order to enable non-expert users in NLP to use this new resource, the corpus has been integrated into IAC (Corpora Interface Access). We discuss the shortcomings of the initial approach and propose a new one, which does not consist in adapting the source texts to the tagger, but rather in modifying the tagger for the direct treatment of the old variants.This second strategy addresses some important shortcomings in the previous approach and is likely to be useful not only in the creation of diachronic linguistic resources but also for the treatment of dialectal or non-standard variants of synchronic languages as well.
dc.format.extent1 p.
dc.language.isoeng
dc.subjectÀrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
dc.subjectÀrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural
dc.subject.lcshComputational linguistics -- Research
dc.subject.lcshNatural language processing (Computer science)
dc.subject.lcshSpanish language -- Research
dc.titleAnnotation and representation of a diachronic corpus of Spanish
dc.typeConference report
dc.subject.lemacLingüística computacional
dc.subject.lemacCorpus (Lingüística)
dc.subject.lemacCastellà -- Lexicografia
dc.contributor.groupUniversitat Politècnica de Catalunya. GPLN - Grup de Processament del Llenguatge Natural
dc.description.peerreviewedPeer Reviewed
dc.rights.accessOpen Access
local.identifier.drac3259538
dc.description.versionPostprint (published version)
local.citation.authorSánchez-Marco, C.; Boleda, G.; Fontana, J.M.; Domingo, J.
local.citation.contributorInternational Conference on Language Resources and Evaluation
local.citation.publicationNameInternational Conference on Language Resources and Evaluation (LREC 2010)


Fitxers d'aquest items

Thumbnail

Aquest ítem apareix a les col·leccions següents

Mostra el registre d'ítem simple