Human language through the lens of large word co-occurrence networks
Tutor / directorFerrer Cancho, Ramon
Document typeUPC Master thesis
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder
We review the article by Ferrer-i-Cancho & Solé (2001) on the word co-occurrence networks of the British National Corpus, improving it and expanding it in multiple directions. First, we replace the statistically flawed criterion to filter out non-statistically significant co-occurrences by a new method that is a combination of Fisher's Exact Test and a Holm-Bonferroni's Correction to control for multiple comparisons. Second, we introduce new measures such as closeness centrality to measure vertex-vertex distance and we discover that significant co-occurrences reported by our method are mostly composed by words with larger centrality. Also we study the degree mixing of the network using the mixing coefficient, defined by Newman, to determine the disassortative nature of the words in the corpus. Third, we study common phenomena over word co-occurrence networks, like the small-world condition and the emergence of Zipf's Law. We discover that networks whose non-significant co-occurrences are filtered out lose the condition of small-world, while non-filtered networks exhibit it, but for both of them we find a double Zipf's Law with different exponents for the regimes in the degree distribution. Fourth, we investigate the effect of randomizing the corpus and the significance level on the properties of the network. We find that the original filtering technique is not able to determine a lack of significance in word co-occurrences from a randomized corpus. Also, we add a discussion on the significance level we have used for the filtering, such that it has an impact on the different properties exhibited by the resulting network after the filtering. To conclude, we try to tie the results from our analysis to human language phenomena, like the existence of two different registries in human language to achieve a successful communication, as reported by Ferrer-i-Cancho.
SubjectsNatural language processing (Computer science), Tractament del llenguatge natural (Informàtica), Zipf's, Llei de
DegreeMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)