Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation
Document typeConference report
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Rights accessRestricted access - publisher's policy
Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker model. Moreover, we also tackle the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model. This fine tuning method achieves better results than training the new speaker from scratch with its own model.
CitationPascual, S., Bonafonte, A. Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation. A: European Signal Processing Conference. "2016 24th European Signal Processing Conference (EUSIPCO): took place 28 August-2 September 2016 in Budapest, Hungary". Institute of Electrical and Electronics Engineers (IEEE), 2016, p. 2325-2329.