Developing competitive HMM PoS taggers using small training corpora
Visualitza/Obre
Estadístiques de LA Referencia / Recolecta
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/97920
Tipus de documentReport de recerca
Data publicació2004-06
Condicions d'accésAccés obert
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
Abstract
This paper presents a study aiming to find out the best strategy to
develop a fast and accurate HMM tagger when only a limited amount of
training material is available. This is a crucial factor when dealing
with languages for which small annotated material is not easily available.
First, we develop some experiments in English, using WSJ corpus as a
test-bench to establish the differences caused by the use of large or
a small train set. Then, we port the results to develop an accurate
Spanish PoS tagger using a limited amount of training data.
Different configurations of a HMM tagger are studied. Namely,
trigram and 4-gram models are tested, as well as different
smoothing techniques. The performance of each configuration depending
on the size of the training corpus is tested in order to determine the
most appropriate setting to develop HMM PoS taggers for languages
with reduced amount of corpus available.
CitacióPadro, M., Padro, L. "Developing competitive HMM PoS taggers using small training corpora". 2004.
Forma partLSI-04-36-R
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
LSI-04-36-R.pdf | 418,8Kb | Visualitza/Obre |