TweetNorm: a benchmark for lexical normalization of spanish tweets

Alegria, Iñaki; Aranberri, Nora; Comas Umbert, Pere Ramon; Fresno, Víctor; Gamallo, Pablo; Padró, Lluís; San Vicente Roncal, Iñaki; Turmo Borras, Jorge; Zubiaga, Arkaitz

doi:10.1007/s10579-015-9315-6

Visualitza/Obre

tweetnorm-benchmark-lexical.pdf (226,9Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Alegria, Iñaki

Aranberri, Nora

Comas Umbert, Pere Ramon

Fresno, Víctor

Gamallo, Pablo

Padró, Lluís

San Vicente Roncal, Iñaki

Turmo Borras, Jorge

Zubiaga, Arkaitz

Tipus de documentArticle

Data publicació2015-12-01

Condicions d'accésAccés obert

Attribution-NonCommercial-NoDerivs 3.0 Spain

Llevat que s'hi indiqui el contrari, els continguts d'aquesta obra estan subjectes a la llicència de Creative Commons : Reconeixement-NoComercial-SenseObraDerivada 3.0 Espanya

Abstract

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets-TweetNorm_es-, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.

CitacióAlegria, I., Aranberri, N., Comas, P.R., Fresno, V., Gamallo, P., Padro, L., San Vicente, I., Turmo, J., Zubiaga, A. TweetNorm: a benchmark for lexical normalization of spanish tweets. "Language resources and evaluation", 01 Desembre 2015, vol. 49, núm. 4, p. 883-905.

URIhttp://hdl.handle.net/2117/80964

DOI10.1007/s10579-015-9315-6

ISSN1574-020X

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
tweetnorm-benchmark-lexical.pdf		226,9Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

TweetNorm: a benchmark for lexical normalization of spanish tweets

Visualitza/Obre

Explora