Mostra el registre d'ítem simple
Human language through the lens of large word co-occurrence networks
dc.contributor | Ferrer Cancho, Ramon |
dc.contributor.author | Mosquera Dopico, Sergio |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament de Ciències de la Computació |
dc.date.accessioned | 2021-02-23T09:30:08Z |
dc.date.available | 2021-02-23T09:30:08Z |
dc.date.issued | 2020-10-29 |
dc.identifier.uri | http://hdl.handle.net/2117/340341 |
dc.description.abstract | We review the article by Ferrer-i-Cancho & Solé (2001) on the word co-occurrence networks of the British National Corpus, improving it and expanding it in multiple directions. First, we replace the statistically flawed criterion to filter out non-statistically significant co-occurrences by a new method that is a combination of Fisher's Exact Test and a Holm-Bonferroni's Correction to control for multiple comparisons. Second, we introduce new measures such as closeness centrality to measure vertex-vertex distance and we discover that significant co-occurrences reported by our method are mostly composed by words with larger centrality. Also we study the degree mixing of the network using the mixing coefficient, defined by Newman, to determine the disassortative nature of the words in the corpus. Third, we study common phenomena over word co-occurrence networks, like the small-world condition and the emergence of Zipf's Law. We discover that networks whose non-significant co-occurrences are filtered out lose the condition of small-world, while non-filtered networks exhibit it, but for both of them we find a double Zipf's Law with different exponents for the regimes in the degree distribution. Fourth, we investigate the effect of randomizing the corpus and the significance level on the properties of the network. We find that the original filtering technique is not able to determine a lack of significance in word co-occurrences from a randomized corpus. Also, we add a discussion on the significance level we have used for the filtering, such that it has an impact on the different properties exhibited by the resulting network after the filtering. To conclude, we try to tie the results from our analysis to human language phenomena, like the existence of two different registries in human language to achieve a successful communication, as reported by Ferrer-i-Cancho. |
dc.language.iso | eng |
dc.publisher | Universitat Politècnica de Catalunya |
dc.subject | Àrees temàtiques de la UPC::Informàtica |
dc.subject.lcsh | Natural language processing (Computer science) |
dc.subject.other | Xarxes de coocurrencias de paraules |
dc.subject.other | Test exacte de Fisher |
dc.subject.other | Correció de Holm-Bonferroni |
dc.subject.other | Small-World |
dc.subject.other | Llei de Zipf |
dc.subject.other | Word Co-occurrence Networks |
dc.subject.other | Fisher's Exact Test |
dc.subject.other | Zipf's Law |
dc.title | Human language through the lens of large word co-occurrence networks |
dc.type | UPC Master thesis |
dc.subject.lemac | Tractament del llenguatge natural (Informàtica) |
dc.subject.lemac | Zipf's, Llei de |
dc.identifier.slug | 153051 |
dc.rights.access | Open Access |
dc.date.updated | 2020-11-05T05:00:49Z |
dc.audience.educationlevel | Màster |
dc.audience.mediator | Facultat d'Informàtica de Barcelona |
dc.audience.degree | MÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012) |