Show simple item record

dc.contributorFerrer Cancho, Ramon
dc.contributor.authorMosquera Dopico, Sergio
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament de Ciències de la Computació
dc.date.accessioned2021-02-23T09:30:08Z
dc.date.available2021-02-23T09:30:08Z
dc.date.issued2020-10-29
dc.identifier.urihttp://hdl.handle.net/2117/340341
dc.description.abstractWe review the article by Ferrer-i-Cancho & Solé (2001) on the word co-occurrence networks of the British National Corpus, improving it and expanding it in multiple directions. First, we replace the statistically flawed criterion to filter out non-statistically significant co-occurrences by a new method that is a combination of Fisher's Exact Test and a Holm-Bonferroni's Correction to control for multiple comparisons. Second, we introduce new measures such as closeness centrality to measure vertex-vertex distance and we discover that significant co-occurrences reported by our method are mostly composed by words with larger centrality. Also we study the degree mixing of the network using the mixing coefficient, defined by Newman, to determine the disassortative nature of the words in the corpus. Third, we study common phenomena over word co-occurrence networks, like the small-world condition and the emergence of Zipf's Law. We discover that networks whose non-significant co-occurrences are filtered out lose the condition of small-world, while non-filtered networks exhibit it, but for both of them we find a double Zipf's Law with different exponents for the regimes in the degree distribution. Fourth, we investigate the effect of randomizing the corpus and the significance level on the properties of the network. We find that the original filtering technique is not able to determine a lack of significance in word co-occurrences from a randomized corpus. Also, we add a discussion on the significance level we have used for the filtering, such that it has an impact on the different properties exhibited by the resulting network after the filtering. To conclude, we try to tie the results from our analysis to human language phenomena, like the existence of two different registries in human language to achieve a successful communication, as reported by Ferrer-i-Cancho.
dc.language.isoeng
dc.publisherUniversitat Politècnica de Catalunya
dc.subjectÀrees temàtiques de la UPC::Informàtica
dc.subject.lcshNatural language processing (Computer science)
dc.subject.otherXarxes de coocurrencias de paraules
dc.subject.otherTest exacte de Fisher
dc.subject.otherCorreció de Holm-Bonferroni
dc.subject.otherSmall-World
dc.subject.otherLlei de Zipf
dc.subject.otherWord Co-occurrence Networks
dc.subject.otherFisher's Exact Test
dc.subject.otherZipf's Law
dc.titleHuman language through the lens of large word co-occurrence networks
dc.typeUPC Master thesis
dc.subject.lemacTractament del llenguatge natural (Informàtica)
dc.subject.lemacZipf's, Llei de
dc.identifier.slug153051
dc.rights.accessOpen Access
dc.date.updated2020-11-05T05:00:49Z
dc.audience.educationlevelMàster
dc.audience.mediatorFacultat d'Informàtica de Barcelona
dc.audience.degreeMÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder