VEU - Grup de Tractament de la Parla

VEU - Grup de Tractament de la Parla http://hdl.handle.net/2117/3746 Fri, 19 Apr 2024 21:45:17 GMT 2024-04-19T21:45:17Z Evaluating gender bias in speech translation http://hdl.handle.net/2117/387472 Evaluating gender bias in speech translation Ruiz Costa-Jussà, Marta; Basta, Christine Raouf Saad; Gallego Olsina, Gerard Ion The scientific community is increasingly aware of the necessity to embrace pluralism and consistently represent major and minor social groups. Currently, there are no standard evaluation techniques for different types of biases. Accordingly, there is an urgent need to provide evaluation sets and protocols to measure existing biases in our automatic systems. Evaluating the biases should be an essential step towards mitigating them in the systems. This paper introduces WinoST, a new freely available challenge set for evaluating gender bias in speech translation. WinoST is the speech version of WinoMT which is a MT challenge set and both follow an evaluation protocol to measure gender accuracy. Using a state-of-the-art end-to-end speech translation system, we report the gender bias evaluation on four language pairs and we show that gender accuracy in speech translation is more than 23% lower than in MT. Tue, 16 May 2023 09:30:02 GMT http://hdl.handle.net/2117/387472 2023-05-16T09:30:02Z Ruiz Costa-Jussà, Marta Basta, Christine Raouf Saad Gallego Olsina, Gerard Ion The scientific community is increasingly aware of the necessity to embrace pluralism and consistently represent major and minor social groups. Currently, there are no standard evaluation techniques for different types of biases. Accordingly, there is an urgent need to provide evaluation sets and protocols to measure existing biases in our automatic systems. Evaluating the biases should be an essential step towards mitigating them in the systems. This paper introduces WinoST, a new freely available challenge set for evaluating gender bias in speech translation. WinoST is the speech version of WinoMT which is a MT challenge set and both follow an evaluation protocol to measure gender accuracy. Using a state-of-the-art end-to-end speech translation system, we report the gender bias evaluation on four language pairs and we show that gender accuracy in speech translation is more than 23% lower than in MT. SHAS: approaching optimal segmentation for end-to-end speech translation http://hdl.handle.net/2117/387121 SHAS: approaching optimal segmentation for end-to-end speech translation Tsiamas, Ioannis; Gallego Olsina, Gerard Ion; Fonollosa, José A. R.; Ruiz Costa-Jussà, Marta Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. First, we train a classifier to identify the included frames in a segmentation, using speech representations from a pre-trained wav2vec 2.0. The optimal splitting points are then found by a probabilistic Divide-and-Conquer algorithm that progressively splits at the frame of lowest probability until all segments are below a pre-specified length. Experiments on MuST-C and mTEDx show that the translation of the segments produced by our method approaches the quality of the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods. Our method is additionally generalizable to different domains and achieves high zero-shot performance in unseen languages. Article pendent de revisió a l'Interspeech 2022 Thu, 04 May 2023 21:53:30 GMT http://hdl.handle.net/2117/387121 2023-05-04T21:53:30Z Tsiamas, Ioannis Gallego Olsina, Gerard Ion Fonollosa, José A. R. Ruiz Costa-Jussà, Marta Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. First, we train a classifier to identify the included frames in a segmentation, using speech representations from a pre-trained wav2vec 2.0. The optimal splitting points are then found by a probabilistic Divide-and-Conquer algorithm that progressively splits at the frame of lowest probability until all segments are below a pre-specified length. Experiments on MuST-C and mTEDx show that the translation of the segments produced by our method approaches the quality of the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods. Our method is additionally generalizable to different domains and achieves high zero-shot performance in unseen languages. Language modelling for speaker diarization in telephonic interviews http://hdl.handle.net/2117/374077 Language modelling for speaker diarization in telephonic interviews India Massana, Miquel Àngel; Hernando Pericás, Francisco Javier; Rodríguez Fonollosa, José Adrián The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks. Thu, 06 Oct 2022 10:54:06 GMT http://hdl.handle.net/2117/374077 2022-10-06T10:54:06Z India Massana, Miquel Àngel Hernando Pericás, Francisco Javier Rodríguez Fonollosa, José Adrián The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks. Attention weights in transformer NMT fail aligning words between sequences but largely explain model predictions http://hdl.handle.net/2117/369772 Attention weights in transformer NMT fail aligning words between sequences but largely explain model predictions Ferrando Monsonís, Javier; Ruiz Costa-Jussà, Marta This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment errors by relying mainly on uninformative tokens from the source sequence. However, we observe that NMT models assign attention to these tokens to regulate the contribution in the prediction of the two contexts, the source and the prefix of the target sequence. We provide evidence about the influence of wrong alignments on the model behavior, demonstrating that the encoder-decoder attention mechanism is well suited as an interpretability method for NMT. Finally, based on our analysis, we propose methods that largely reduce the word alignment error rate compared to standard induced alignments from attention weights. Thu, 07 Jul 2022 10:23:16 GMT http://hdl.handle.net/2117/369772 2022-07-07T10:23:16Z Ferrando Monsonís, Javier Ruiz Costa-Jussà, Marta This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment errors by relying mainly on uninformative tokens from the source sequence. However, we observe that NMT models assign attention to these tokens to regulate the contribution in the prediction of the two contexts, the source and the prefix of the target sequence. We provide evidence about the influence of wrong alignments on the model behavior, demonstrating that the encoder-decoder attention mechanism is well suited as an interpretability method for NMT. Finally, based on our analysis, we propose methods that largely reduce the word alignment error rate compared to standard induced alignments from attention weights. A genetic programming approach for economic forecasting with survey expectations http://hdl.handle.net/2117/369742 A genetic programming approach for economic forecasting with survey expectations Claveria González, Oscar; Monte Moreno, Enrique; Torra Porras, Salvador We apply a soft computing method to generate country-specific economic sentiment indicators that provide estimates of year-on-year GDP growth rates for 19 European economies. First, genetic programming is used to evolve business and consumer economic expectations to derive sentiment indicators for each country. To assess the performance of the proposed indicators, we first design a nowcasting experiment in which we recursively generate estimates of GDP at the end of each quarter, using the latest business and consumer survey data available. Second, we design a forecasting exercise in which we iteratively re-compute the sentiment indicators in each out-of-sample period. When evaluating the accuracy of the predictions obtained for different forecast horizons, we find that the evolved sentiment indicators outperform the time-series models used as a benchmark. These results show the potential of the proposed approach for prediction purposes. Thu, 07 Jul 2022 06:16:31 GMT http://hdl.handle.net/2117/369742 2022-07-07T06:16:31Z Claveria González, Oscar Monte Moreno, Enrique Torra Porras, Salvador We apply a soft computing method to generate country-specific economic sentiment indicators that provide estimates of year-on-year GDP growth rates for 19 European economies. First, genetic programming is used to evolve business and consumer economic expectations to derive sentiment indicators for each country. To assess the performance of the proposed indicators, we first design a nowcasting experiment in which we recursively generate estimates of GDP at the end of each quarter, using the latest business and consumer survey data available. Second, we design a forecasting exercise in which we iteratively re-compute the sentiment indicators in each out-of-sample period. When evaluating the accuracy of the predictions obtained for different forecast horizons, we find that the evolved sentiment indicators outperform the time-series models used as a benchmark. These results show the potential of the proposed approach for prediction purposes. Systematic detection of anomalous ionospheric perturbations above LEOs from GNSS POD Data including possible tsunami signatures http://hdl.handle.net/2117/369717 Systematic detection of anomalous ionospheric perturbations above LEOs from GNSS POD Data including possible tsunami signatures Yang, Heng; Hernández Pajares, Manuel; Jarmolowski, Wojciech; Wielgosz, Pawel; Vadas, Sharon L.; Colombo, Oscar L.; Monte Moreno, Enrique; García Rigo, Alberto; Graffigna, Victoria; Krypiak-Gregorczyk, Anna; Milanowska, Beata; Bofill Soliguer, Pablo; Olivares Pulido, Germán; Liu, Qi; Haagmans, Roger In this article, we show the capability of a global navigation satellite system (GNSS) precise orbit determination (POD) low Earth orbit (LEO) data to detect anomalous ionospheric disturbances in the spectral range of the signals associated with earthquakes and tsunamis, applied to two of these events in Papua New Guinea (PNG) and the Solomon Islands during 2016. This is achieved thanks to the new PIES approach (POD-GNSS LEO Detrended Ionospheric Electron Content Significant Deviations). The significance of such ionospheric signals above the swarm LEOs is confirmed with different types of independent data: in situ electron density measurements provided by the Langmuir Probe (LP) onboard swarm LEOs, DORIS, and ground-based GNSS colocated measurements, as it is described in this article. In this way, we conclude the possible detection of the tsunami-related ionospheric gravity wave in PNG 2016 event, consistent with the most-recent theory, which shows that a tsunami (which is localized in space and time) excites a spectrum of gravity waves, some of which have faster horizontal phase speeds than the tsunami. We believe that this work shows as well the feasibility of a future potential monitoring system of ionospheric disturbances, to be made possible by hundreds of CubeSats with POD GNSS receivers among other appropriate sensors, and supported for real-time or near real-time confirmation and characterization by thousands of worldwide existing ground GNSS receivers. © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Wed, 06 Jul 2022 11:07:53 GMT http://hdl.handle.net/2117/369717 2022-07-06T11:07:53Z Yang, Heng Hernández Pajares, Manuel Jarmolowski, Wojciech Wielgosz, Pawel Vadas, Sharon L. Colombo, Oscar L. Monte Moreno, Enrique García Rigo, Alberto Graffigna, Victoria Krypiak-Gregorczyk, Anna Milanowska, Beata Bofill Soliguer, Pablo Olivares Pulido, Germán Liu, Qi Haagmans, Roger In this article, we show the capability of a global navigation satellite system (GNSS) precise orbit determination (POD) low Earth orbit (LEO) data to detect anomalous ionospheric disturbances in the spectral range of the signals associated with earthquakes and tsunamis, applied to two of these events in Papua New Guinea (PNG) and the Solomon Islands during 2016. This is achieved thanks to the new PIES approach (POD-GNSS LEO Detrended Ionospheric Electron Content Significant Deviations). The significance of such ionospheric signals above the swarm LEOs is confirmed with different types of independent data: in situ electron density measurements provided by the Langmuir Probe (LP) onboard swarm LEOs, DORIS, and ground-based GNSS colocated measurements, as it is described in this article. In this way, we conclude the possible detection of the tsunami-related ionospheric gravity wave in PNG 2016 event, consistent with the most-recent theory, which shows that a tsunami (which is localized in space and time) excites a spectrum of gravity waves, some of which have faster horizontal phase speeds than the tsunami. We believe that this work shows as well the feasibility of a future potential monitoring system of ionospheric disturbances, to be made possible by hundreds of CubeSats with POD GNSS receivers among other appropriate sensors, and supported for real-time or near real-time confirmation and characterization by thousands of worldwide existing ground GNSS receivers. On the locality of attention in direct speech translation http://hdl.handle.net/2117/369036 On the locality of attention in direct speech translation Alastruey Lasheras, Belén; Ferrando Monsonís, Javier; Gallego Olsina, Gerard Ion; Ruiz Costa-Jussà, Marta Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards. Thu, 23 Jun 2022 07:12:55 GMT http://hdl.handle.net/2117/369036 2022-06-23T07:12:55Z Alastruey Lasheras, Belén Ferrando Monsonís, Javier Gallego Olsina, Gerard Ion Ruiz Costa-Jussà, Marta Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards. Multilingual machine translation: Deep analysis of language-specific encoder-decoders http://hdl.handle.net/2117/368571 Multilingual machine translation: Deep analysis of language-specific encoder-decoders Escolano Peinado, Carlos; Ruiz Costa-Jussà, Marta; Rodríguez Fonollosa, José Adrián State-of-the-art multilingual machine translation relies on a shared encoder-decoder. In this paper, we propose an alternative approach based on language-specific encoder-decoders, which can be easily extended to new languages by learning their corresponding modules. To establish a common interlingua representation, we simultaneously train N initial languages. Our experiments show that the proposed approach improves over the shared encoder-decoder for the initial languages and when adding new languages, without the need to retrain the remaining modules. All in all, our work closes the gap between shared and language-specific encoder-decoders, advancing toward modular multilingual machine translation systems that can be flexibly extended in lifelong learning settings. Thu, 16 Jun 2022 10:05:12 GMT http://hdl.handle.net/2117/368571 2022-06-16T10:05:12Z Escolano Peinado, Carlos Ruiz Costa-Jussà, Marta Rodríguez Fonollosa, José Adrián State-of-the-art multilingual machine translation relies on a shared encoder-decoder. In this paper, we propose an alternative approach based on language-specific encoder-decoders, which can be easily extended to new languages by learning their corresponding modules. To establish a common interlingua representation, we simultaneously train N initial languages. Our experiments show that the proposed approach improves over the shared encoder-decoder for the initial languages and when adding new languages, without the need to retrain the remaining modules. All in all, our work closes the gap between shared and language-specific encoder-decoders, advancing toward modular multilingual machine translation systems that can be flexibly extended in lifelong learning settings. High frequent in-domain word segmentation and forward translation for the WMT21 Biomedical task http://hdl.handle.net/2117/366780 High frequent in-domain word segmentation and forward translation for the WMT21 Biomedical task Rafieian, Bardia; Ruiz Costa-Jussà, Marta This paper reports the optimization of using the out-of-domain data in the Biomedical translation task. We firstly optimized our parallel training dataset using the BabelNet in-domain terminology words. Afterward, to increase the training set, we studied the effects of the out-of-domain data on biomedical translation tasks, and we created a mixture of in-domain and out-of-domain training sets and added more in-domain data using forward translation in the English-Spanish task. Finally, with a simple bpe optimization method, we increased the number of in-domain subwords in our mixed training set and trained the Transformer model on the generated data. Results show improvements using our proposed method. © 2021 Association for Computational Linguistics Wed, 04 May 2022 09:37:04 GMT http://hdl.handle.net/2117/366780 2022-05-04T09:37:04Z Rafieian, Bardia Ruiz Costa-Jussà, Marta This paper reports the optimization of using the out-of-domain data in the Biomedical translation task. We firstly optimized our parallel training dataset using the BabelNet in-domain terminology words. Afterward, to increase the training set, we studied the effects of the out-of-domain data on biomedical translation tasks, and we created a mixture of in-domain and out-of-domain training sets and added more in-domain data using forward translation in the English-Spanish task. Finally, with a simple bpe optimization method, we increased the number of in-domain subwords in our mixed training set and trained the Transformer model on the generated data. Results show improvements using our proposed method. © 2021 Association for Computational Linguistics Enhancing sequence-to-sequence modeling for RDF triples to natural text http://hdl.handle.net/2117/366257 Enhancing sequence-to-sequence modeling for RDF triples to natural text Domingo Roig, Oriol; Bergés Lladó, David; Cantenys Sabà, Roser; Creus Castanyer, Roger; Rodríguez Fonollosa, José Adrián Establishes key guidelines on how, which and when Machine Translation (MT) techniques are worth applying to RDF-to-Text task. Not only do we apply and compare the most prominent MT architecture, the Transformer, but we also analyze state-of-the-art techniques such as Byte Pair Encoding or Back Translation to demonstrate an improvement in generalization. In addition, we empirically show how to tailor these techniques to enhance models relying on learned embeddings rather than using pretrained ones. Automatic metrics suggest that Back Translation can significantly improve model performance up to 7 BLEU points, hence, opening a window for surpassing state-of-the-art results with appropriate architectures. Fri, 22 Apr 2022 11:19:00 GMT http://hdl.handle.net/2117/366257 2022-04-22T11:19:00Z Domingo Roig, Oriol Bergés Lladó, David Cantenys Sabà, Roser Creus Castanyer, Roger Rodríguez Fonollosa, José Adrián Establishes key guidelines on how, which and when Machine Translation (MT) techniques are worth applying to RDF-to-Text task. Not only do we apply and compare the most prominent MT architecture, the Transformer, but we also analyze state-of-the-art techniques such as Byte Pair Encoding or Back Translation to demonstrate an improvement in generalization. In addition, we empirically show how to tailor these techniques to enhance models relying on learned embeddings rather than using pretrained ones. Automatic metrics suggest that Back Translation can significantly improve model performance up to 7 BLEU points, hence, opening a window for surpassing state-of-the-art results with appropriate architectures.