Automatic normalization of short texts by combining statistical and rule-based techniques

Ruiz Costa-Jussà, Marta; Banchs, Rafael E.

doi:10.1007/s10579-012-9187-y

Visualitza/Obre

lreshorttext_posprint.pdf (266,6Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Ruiz Costa-Jussà, Marta

Banchs, Rafael E.

Tipus de documentArticle

Data publicació2013-03-01

Condicions d'accésAccés obert

Attribution-NonCommercial-NoDerivs 3.0 Spain

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya

ProjecteT4ME NET - Technologies for the Multilingual European Information Society (EC-FP7-249119)

Abstract

Short texts are typically composed of small number of words, most of which are abbreviations, typos and other kinds of noise. This makes the noise to signal ratio relatively high for this specific category of text. A high proportion of noise in the data is undesirable for analysis procedures as well as machine learning applications. Text normalization techniques are used to reduce the noise and improve the quality of text for processing and analysis purposes. In this work, we propose a combination of statistical and rule-based techniques to normalize short texts. More specifically, we focus our attention on SMS messages. We base our normalization approach on a statistical machine translation system which translates from noisy data to clean data. This system is trained on a small manually annotated set. Then, we study several automatic methods to extract more general rules from the normalizations generated with the statistical machine translation system. We illustrate the proposed methodology by conducting some experiments with a SMS Haitian-Créole data collection. In order to evaluate the performance of our methodology we use several Haitian-Créole dictionaries, the well-known perplexity criteria and the achieved reduction of vocabulary.

Descripció

The final publication is available at link.springer.com http://link.springer.com/article/10.1007%2Fs10579-012-9187-y#page-1

CitacióRuiz, M., Banchs, R. Automatic normalization of short texts by combining statistical and rule-based techniques. "Language resources and evaluation", 1 Març 2013, vol. 47, p. 179-193.

URIhttp://hdl.handle.net/2117/102182

DOI10.1007/s10579-012-9187-y

ISSN1574-020X

Versió de l'editorhttp://link.springer.com/article/10.1007%2Fs10579-012-9187-y#page-1

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
lreshorttext_posprint.pdf		266,6Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Automatic normalization of short texts by combining statistical and rule-based techniques

Visualitza/Obre

Explora