Direct expressive voice training based on semantic selection
Visualitza/Obre
0979.PDF (255,7Kb) (Accés restringit)
Sol·licita una còpia a l'autor
Què és aquest botó?
Aquest botó permet demanar una còpia d'un document restringit a l'autor. Es mostra quan:
- Disposem del correu electrònic de l'autor
- El document té una mida inferior a 20 Mb
- Es tracta d'un document d'accés restringit per decisió de l'autor o d'un document d'accés restringit per política de l'editorial
10.21437/Interspeech.2016-979
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/100351
Tipus de documentText en actes de congrés
Data publicació2016
EditorInternational Speech Communication Association (ISCA)
Condicions d'accésAccés restringit per política de l'editorial
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
Abstract
This work aims at creating expressive voices from audiobooks using semantic selection. First, for each
utterance of the audiobook an acoustic feature vector is extracted, including iVectors built on MFCC and on F0 basis.
Then, the transcription is projected into a semantic vector space. A seed utterance is projected to the semantic vector space and the N nearest neighbors are selected. The selection is then filtered by selecting only acoustically similar
data. The proposed technique can be used to train emotional voices by using emotional keywords or phrases as
seeds, obtaining training data semantically similar to the seed. It can also be used to read larger texts in an expressive
manner, creating specific voices for each sentence. That later application is compared to a DNN predictor, which
predicts acoustic features from semantic features. The selected data is used to adapt statistical speech synthesis
models. The performance of the technique is analyzed objectively and in a perceptive experiment. In the first part of
the experiment, subjects clearly show preference for particular expressive voices to synthesize semantically expressive
utterances. In the second part, the proposed method is shown to achieve similar or better performance than the DNN
based prediction. Copyright © 2016 ISCA.
CitacióJauk, I., Bonafonte, A. Direct expressive voice training based on semantic selection. A: Annual Conference of the International Speech Communication Association. "INTERSPEECH 2016: September 8-12, 2016, San Francisco, USA". San Francisco, CA: International Speech Communication Association (ISCA), 2016, p. 3181-3185.
ISBN1990-9770
Versió de l'editorhttp://www.isca-speech.org/archive/Interspeech_2016/pdfs/0979.PDF
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
0979.PDF | 255,7Kb | Accés restringit |