Direct expressive voice training based on semantic selection

Jauk, Igor; Bonafonte Cávez, Antonio

doi:10.21437/Interspeech.2016-979

Visualitza/Obre

0979.PDF (255,7Kb) (Accés restringit) Sol·licita una còpia a l'autor

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Jauk, Igor

Bonafonte Cávez, Antonio

Tipus de documentText en actes de congrés

Data publicació2016

EditorInternational Speech Communication Association (ISCA)

Condicions d'accésAccés restringit per política de l'editorial

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

This work aims at creating expressive voices from audiobooks using semantic selection. First, for each utterance of the audiobook an acoustic feature vector is extracted, including iVectors built on MFCC and on F0 basis. Then, the transcription is projected into a semantic vector space. A seed utterance is projected to the semantic vector space and the N nearest neighbors are selected. The selection is then filtered by selecting only acoustically similar data. The proposed technique can be used to train emotional voices by using emotional keywords or phrases as seeds, obtaining training data semantically similar to the seed. It can also be used to read larger texts in an expressive manner, creating specific voices for each sentence. That later application is compared to a DNN predictor, which predicts acoustic features from semantic features. The selected data is used to adapt statistical speech synthesis models. The performance of the technique is analyzed objectively and in a perceptive experiment. In the first part of the experiment, subjects clearly show preference for particular expressive voices to synthesize semantically expressive utterances. In the second part, the proposed method is shown to achieve similar or better performance than the DNN based prediction. Copyright © 2016 ISCA.

CitacióJauk, I., Bonafonte, A. Direct expressive voice training based on semantic selection. A: Annual Conference of the International Speech Communication Association. "INTERSPEECH 2016: September 8-12, 2016, San Francisco, USA". San Francisco, CA: International Speech Communication Association (ISCA), 2016, p. 3181-3185.

URIhttp://hdl.handle.net/2117/100351

DOI10.21437/Interspeech.2016-979

ISBN1990-9770

Versió de l'editorhttp://www.isca-speech.org/archive/Interspeech_2016/pdfs/0979.PDF

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
0979.PDF		255,7Kb	PDF	Accés restringit

UPCommons. Portal del coneixement obert de la UPC

Direct expressive voice training based on semantic selection

Visualitza/Obre

Explora