Human language through the lens of large word co-occurrence networks

Mosquera Dopico, Sergio

dc.contributor	Ferrer Cancho, Ramon
dc.contributor.author	Mosquera Dopico, Sergio
dc.contributor.other	Universitat Politècnica de Catalunya. Departament de Ciències de la Computació
dc.date.accessioned	2021-02-23T09:30:08Z
dc.date.available	2021-02-23T09:30:08Z
dc.date.issued	2020-10-29
dc.identifier.uri	http://hdl.handle.net/2117/340341
dc.description.abstract	We review the article by Ferrer-i-Cancho & Solé (2001) on the word co-occurrence networks of the British National Corpus, improving it and expanding it in multiple directions. First, we replace the statistically flawed criterion to filter out non-statistically significant co-occurrences by a new method that is a combination of Fisher's Exact Test and a Holm-Bonferroni's Correction to control for multiple comparisons. Second, we introduce new measures such as closeness centrality to measure vertex-vertex distance and we discover that significant co-occurrences reported by our method are mostly composed by words with larger centrality. Also we study the degree mixing of the network using the mixing coefficient, defined by Newman, to determine the disassortative nature of the words in the corpus. Third, we study common phenomena over word co-occurrence networks, like the small-world condition and the emergence of Zipf's Law. We discover that networks whose non-significant co-occurrences are filtered out lose the condition of small-world, while non-filtered networks exhibit it, but for both of them we find a double Zipf's Law with different exponents for the regimes in the degree distribution. Fourth, we investigate the effect of randomizing the corpus and the significance level on the properties of the network. We find that the original filtering technique is not able to determine a lack of significance in word co-occurrences from a randomized corpus. Also, we add a discussion on the significance level we have used for the filtering, such that it has an impact on the different properties exhibited by the resulting network after the filtering. To conclude, we try to tie the results from our analysis to human language phenomena, like the existence of two different registries in human language to achieve a successful communication, as reported by Ferrer-i-Cancho.
dc.language.iso	eng
dc.publisher	Universitat Politècnica de Catalunya
dc.subject	Àrees temàtiques de la UPC::Informàtica
dc.subject.lcsh	Natural language processing (Computer science)
dc.subject.other	Xarxes de coocurrencias de paraules
dc.subject.other	Test exacte de Fisher
dc.subject.other	Correció de Holm-Bonferroni
dc.subject.other	Small-World
dc.subject.other	Llei de Zipf
dc.subject.other	Word Co-occurrence Networks
dc.subject.other	Fisher's Exact Test
dc.subject.other	Zipf's Law
dc.title	Human language through the lens of large word co-occurrence networks
dc.type	UPC Master thesis
dc.subject.lemac	Tractament del llenguatge natural (Informàtica)
dc.subject.lemac	Zipf's, Llei de
dc.identifier.slug	153051
dc.rights.access	Open Access
dc.date.updated	2020-11-05T05:00:49Z
dc.audience.educationlevel	Màster
dc.audience.mediator	Facultat d'Informàtica de Barcelona
dc.audience.degree	MÀSTER UNIVERSITARI EN INNOVACIÓ I RECERCA EN INFORMÀTICA (Pla 2012)

Fitxers d'aquest items

Nom:: 153051.pdf
Mida:: 10,45Mb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Master in Innovation and Research in Informatics - MIRI [454]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Human language through the lens of large word co-occurrence networks

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora