Acoustic feature prediction from semantic features for expressive speech using deep neural networks
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessRestricted access - publisher's policy
The goal of the study is to predict acoustic features of expressive speech from semantic vector space representations. Though a lot of successful work was invested in expressiveness analysis and prediction, the results often involve manual labeling, or indirect prediction evaluation such as speech synthesis. The proposed analysis aims at direct acoustic feature prediction and comparison to original acoustic features from an audiobook. The audiobook is mapped in a semantic vector space. A set of acoustic features is extracted from the same utterances, involving iVectors trained on MFCC and F0 basis. Two regression models are trained with the semantic coordinates, DNNs and a baseline CART. Later, semantic and acoustic context features are combined for the prediction. The prediction is achieved successfully using the DNNs. A closer analysis shows that the prediction works best for larger utterances or utterances with specific contexts, and worst for general short utterances and proper names.
CitationJauk, I., Bonafonte, A., Pascual, S. Acoustic feature prediction from semantic features for expressive speech using deep neural networks. A: European Signal Processing Conference. "2016 24th European Signal Processing Conference (EUSIPCO): took place 28 August-2 September 2016 in Budapest, Hungary". Institute of Electrical and Electronics Engineers (IEEE), 2016, p. 2320-2324.