Quality-driven synthetic text generation for multilingual speech translation with audio large language models
Títol de la revista
ISSN de la revista
Títol del volum
Autors
Correu electrònic de l'autor
Tutor / director
Tribunal avaluador
Realitzat a/amb
Tipus de document
Data
Condicions d'accés
Llicència
Publicacions relacionades
Datasets relacionats
Projecte CCD
Abstract
This thesis explores quality-driven synthetic data generation as a scalable solution for mul- tilingual speech-to-text translation (S2TT), focusing on Iberian languages with limited natural resources. Leveraging large language models (LLMs) and rigorous reference-free quality filtering via BLASER 2.0, an end-to-end pipeline was implemented to generate millions of high-quality synthetic translations. The approach demonstrates substantial improvements in translation quality and semantic similarity for low-resource languages such as Asturian and Occitan, while enabling efficient scaling to diverse linguistic do- mains. Experimental results reveal that models trained on filtered synthetic data achieve competitive and often state-of-the-art performance in S2TT tasks, and narrow the gap between direct and Chain-of-Thought cascade architectures. This work lays foundational evidence that scalable, quality-centric synthetic data pipelines are powerful enablers for inclusive, robust multilingual speech technologies, especially where manual annotation remains costly or infeasible.
Descripció
.



