Ir al contenido (pulsa Retorno)

Universitat Politècnica de Catalunya

    • Català
    • Castellano
    • English
    • LoginRegisterLog in (no UPC users)
  • mailContact Us
  • world English 
    • Català
    • Castellano
    • English
  • userLogin   
      LoginRegisterLog in (no UPC users)

UPCommons. Global access to UPC knowledge

Banner header
76.415 UPC academic works
You are here:
View Item 
  •   DSpace Home
  • Treballs acadèmics
  • Màsters oficials
  • Master in Artificial Intelligence - MAI
  • View Item
  •   DSpace Home
  • Treballs acadèmics
  • Màsters oficials
  • Master in Artificial Intelligence - MAI
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

A pipeline for large raw text preprocessing and model training of language models at scale

Thumbnail
View/Open
155957.pdf (2,179Mb)
  View UPCommons Usage Statistics
  LA Referencia / Recolecta stats
Includes usage data since 2022
Cita com:
hdl:2117/343268

Show full item record
Armengol Estapé, Jordi
Tutor / directorRuiz Costa-Jussà, MartaMés informacióMés informació; Melero Nogues, Maite
CovenanteeUniversitat de Barcelona; Universitat Rovira i Virgili
Document typeMaster thesis
Date2021-01-25
Rights accessOpen Access
All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder
Abstract
The advent of Transformer-based (i.e., based on self-attention architectures) language models has revolutionized the entire field of Natural Language Processing (NLP). Once pre-trained on large, unlabelled corpora, we can apply transfer learning to virtually all downstream tasks. The paradigmatic example is the BERT model. Recent works have proposed alternative pre-training algorithms and neural architectures for improving the efficiency or the performance of the models. Besides, the ecosystem of frameworks for using these models has flourished. Nevertheless, less attention has been paid to the practical issues of preparing new corpora for pre-training language models and training them effectively from scratch in High-Performance Computing (HPC) clusters. Preprocessing new corpora is critical for languages and domains that do not have enough published resources. In contrast, the practical details of training language models from scratch are less known than those for fine-tuning existing models. Also, if the quality of the data is enhanced, language and domain-specific language models have already been shown to outperform their multilingual and general-domain counterparts, at least in some cases. This project consists of developing a preprocessing and training pipeline for generating language models at scale, especially targeting under-resources. The preprocessing pipeline's crucial role consists of cleaning raw text and formatting it as needed while preserving document-level coherency (if possible) to learn long-range dependencies. Most of the existing data gathering and cleaning methods for NLP have focused more on the quantity than the quality. Since our approach aims to be compatible with low-resource languages and domains, the filtering should be as fine-grained as possible (or risk losing useful data). Unlike other works, we put special emphasis on the generation of resources for training these models. Regarding training, learning from scratch large models presents several challenges, even if leveraging existing libraries. Apart from adapting to the specifics of an HPC cluster and a careful choice of hyperparameters, ideally, the training procedure should be relatively low-resource-friendly. We show our system's application for generating new corpora in real-world use cases and how these data can be effectively used for training models from scratch.
SubjectsNatural language processing (Computer science), Deep learning, Tractament del llenguatge natural (Informàtica), Aprenentatge profund
DegreeMÀSTER UNIVERSITARI EN INTEL·LIGÈNCIA ARTIFICIAL (Pla 2017)
URIhttp://hdl.handle.net/2117/343268
Collections
  • Màsters oficials - Master in Artificial Intelligence - MAI [341]
  View UPCommons Usage Statistics

Show full item record

FilesDescriptionSizeFormatView
155957.pdf2,179MbPDFView/Open

Browse

This CollectionBy Issue DateAuthorsOther contributionsTitlesSubjectsThis repositoryCommunities & CollectionsBy Issue DateAuthorsOther contributionsTitlesSubjects

© UPC Obrir en finestra nova . Servei de Biblioteques, Publicacions i Arxius

info.biblioteques@upc.edu

  • About This Repository
  • Metadata under:Metadata under CC0
  • Contact Us
  • Send Feedback
  • Privacy Settings
  • Inici de la pàgina