Developing competitive HMM PoS taggers using small training corpora

Padró Cirera, Montserrat; Padró, Lluís

Visualitza/Obre

LSI-04-36-R.pdf (418,8Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Padró Cirera, Montserrat

Padró, Lluís

Tipus de documentReport de recerca

Data publicació2004-06

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

This paper presents a study aiming to find out the best strategy to develop a fast and accurate HMM tagger when only a limited amount of training material is available. This is a crucial factor when dealing with languages for which small annotated material is not easily available. First, we develop some experiments in English, using WSJ corpus as a test-bench to establish the differences caused by the use of large or a small train set. Then, we port the results to develop an accurate Spanish PoS tagger using a limited amount of training data. Different configurations of a HMM tagger are studied. Namely, trigram and 4-gram models are tested, as well as different smoothing techniques. The performance of each configuration depending on the size of the training corpus is tested in order to determine the most appropriate setting to develop HMM PoS taggers for languages with reduced amount of corpus available.

CitacióPadro, M., Padro, L. "Developing competitive HMM PoS taggers using small training corpora". 2004.

Forma partLSI-04-36-R

URIhttp://hdl.handle.net/2117/97920

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
LSI-04-36-R.pdf		418,8Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Developing competitive HMM PoS taggers using small training corpora

Visualitza/Obre

Explora