GeBioToolkit  automatic extraction of gender-balanced multilingual corpus of wikipedia biographies

Li Lin, Pau

Visualitza/Obre

148699.pdf (606,4Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Li Lin, Pau

Tutor / directorPadró, Lluís

; Ruiz Costa-Jussà, Marta

Tipus de documentProjecte Final de Màster Oficial

Data2020-01-31

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

We present GeBioToolkit, an automatic tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present on Wikipedia, the toolkit has been designed to extract corpus balanced. While our toolkit is customizable to work in any number of languages, this tool can be customizable to work in other domains that are not related with gender or biographies such as medical, financial or historical domains. In this work we present two different corpora extracted with GeBioToolkit, GeBioCorpus v1 and v2. GeBioCorpus v1 is composed by 10.000 sentences in English, Spanish and Catalan directly extracted with GeBioToolkit. GeBioCorpus v2 is composed of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. Both datasets are disjointed from each other, which allows the user to use GeBioCorpus V1 to train a model, and GeBioCorpus V2 to test and perform analysis of the possible inequalities within the Wikipedia.

MatèriesMachine learning, Aprenentatge automàtic

TitulacióMÀSTER UNIVERSITARI EN INTEL·LIGÈNCIA ARTIFICIAL (Pla 2017)

URIhttp://hdl.handle.net/2117/339749

Col·leccions

Màsters oficials - Master in Artificial Intelligence - MAI [278]

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
148699.pdf		606,4Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

GeBioToolkit automatic extraction of gender-balanced multilingual corpus of wikipedia biographies

Visualitza/Obre

Explora