A comparison of approaches for measuring cross-lingual similarity of wikipedia articles
Document typeConference report
Rights accessOpen Access
European Commisision's projectWIQ-EI - Web Information Quality Evaluation Initiative (EC-FP7-269180)
Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and Cross-Language Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-n-grams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).
CitationBarron-Cedeño, A. [et al.]. A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles. A: European Conference on Information Retrieval. "Advances in information retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16, 2014: proceedings". Amsterdam: Springer, 2014, p. 424-429.