Self multi-head attention for speaker recognition
Document typeConference lecture
PublisherInternational Speech Communication Association (ISCA)
Rights accessOpen Access
Most state-of-the-art Deep Learning (DL) approaches forspeaker recognition work on a short utterance level. Given thespeech signal, these algorithms extract a sequence of speakerembeddings from short segments and those are averaged to ob-tain an utterance level speaker representation. In this work wepropose the use of an attention mechanism to obtain a discrim-inative speaker embedding given non fixed length speech utter-ances. Our system is based on a Convolutional Neural Network(CNN) that encodes short-term speaker features from the spec-trogram and a self multi-head attention model that maps theserepresentations into a long-term speaker embedding. The atten-tion model that we propose produces multiple alignments fromdifferent subsegments of the CNN encoded states over the se-quence. Hence this mechanism works as a pooling layer whichdecides the most discriminative features over the sequence toobtain an utterance level representation. We have tested thisapproach for the verification task for the VoxCeleb1 dataset.The results show that self multi-head attention outperforms bothtemporal and statistical pooling methods with a18%of rela-tive EER. Obtained results show a58%relative improvementin EER compared to i-vector+PLDA
CitationIndia, M.; Safari, P.; Hernando, J. Self multi-head attention for speaker recognition. A: Annual Conference of the International Speech Communication Association. "Interspeech 2019: the 20th Annual Conference of the International Speech Communication Association: 15-19 September 2019: Graz, Austria". Baixas: International Speech Communication Association (ISCA), 2019, p. 4305-4309.