Deep learning backend for single and multisession i-vector speaker recognition

Ghahabi Esfahani, Omid; Hernando Pericás, Francisco Javier

doi:10.1109/TASLP.2017.2661705

Visualitza/Obre

Versió publicada pel l'editor. En accés obert a IEEE (1,122Mb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Ghahabi Esfahani, Omid

Hernando Pericás, Francisco Javier

Tipus de documentArticle

Data publicació2017-04-01

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

Abstract

The lack of labeled background data makes a big performance gap between cosine and Probabilistic Linear Discriminant Analysis (PLDA) scoring baseline techniques for i-vectors in speaker recognition. Although there are some unsupervised clustering techniques to estimate the labels, they cannot accurately predict the true labels and they also assume that there are several samples from the same speaker in the background data that could not be true in reality. In this paper, the authors make use of Deep Learning (DL) to fill this performance gap given unlabeled background data. To this goal, the authors have proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on deep belief networks and deep neural networks to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single- and multisession speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Experiments on National Institute of Standards and Technology 2014 i-vector challenge show that 46% of this performance gap, in terms of minimum of the decision cost function, is filled by the proposed DL-based system. Furthermore, the score combination of the proposed DL-based system and PLDA with estimated labels covers 79% of this gap.

CitacióGhahabi, O., Hernando, J. Deep learning backend for single and multisession i-vector speaker recognition. "IEEE-ACM Transactions on Audio Speech and Language Processing", 1 Abril 2017, vol. 25, núm. 4, p. 807-817.

URIhttp://hdl.handle.net/2117/104282

DOI10.1109/TASLP.2017.2661705

ISSN2329-9290

Versió de l'editorhttp://ieeexplore.ieee.org/document/7847321/?reload=true

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
07847321.pdf	Versió publicada pel l'editor. En accés obert a IEEE	1,122Mb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Deep learning backend for single and multisession i-vector speaker recognition

Visualitza/Obre

Explora