Unsupervised training of siamese networks for speaker verification

Khan, Umair; Hernando Pericás, Francisco Javier

doi:10.21437/Interspeech.2020-1882

Visualitza/Obre

1882.pdf (343,1Kb)

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Khan, Umair

Hernando Pericás, Francisco Javier

Tipus de documentText en actes de congrés

Data publicació2020

EditorInternational Speech Communication Association (ISCA)

Condicions d'accésAccés obert

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

ProjecteTECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO (MINECO-TEC2015-69266-P)

Abstract

Speaker labeled background data is an essential requirement for most state-of-the-art approaches in speaker recognition, e.g., xvectors and i-vector/PLDA. However, in reality it is difficult to access large amount of labeled data. In this work, we propose siamese networks for speaker verification without using speaker labels. We propose two different siamese networks having two and three branches, respectively, where each branch is a CNN encoder. Since the goal is to avoid speaker labels, we propose to generate the training pairs in an unsupervised manner. The client samples are selected within one database according to highest cosine scores with the anchor in i-vector space. The impostor samples are selected in the same way but from another database. Our double-branch siamese performs binary classification using cross entropy loss during training. In testing phase, we obtain speaker verification scores directly from its output layer. Whereas, our triple-branch siamese is trained to learn speaker embeddings using triplet loss. During testing, we extract speaker embeddings from its output layer, which are scored in the experiments using cosine scoring. The evaluation is performed on VoxCeleb-1 database, which show that using the proposed unsupervised systems, solely or in fusion, the results get closer to supervised baseline

CitacióKhan, U.; Hernando, J. Unsupervised training of siamese networks for speaker verification. A: Annual Conference of the International Speech Communication Association. "Interspeech 2020: the 20th Annual Conference of the International Speech Communication Association: 25-29 October 2020: Shanghai, China". Baixas: International Speech Communication Association (ISCA), 2020, p. 3002-3006. ISBN 1990-9772. DOI 10.21437/Interspeech.2020-1882.

URIhttp://hdl.handle.net/2117/332092

DOI10.21437/Interspeech.2020-1882

ISBN1990-9772

Versió de l'editorhttp://dx.doi.org/10.21437/Interspeech.2020-1882

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
1882.pdf		343,1Kb	PDF	Visualitza/Obre

UPCommons. Portal del coneixement obert de la UPC

Unsupervised training of siamese networks for speaker verification

Visualitza/Obre

Explora