Show simple item record

dc.contributor.authorIndia Massana, Miquel Àngel
dc.contributor.authorSagastiberri, Itziar
dc.contributor.authorPalau Puigdevall, Ponç
dc.contributor.authorSayrol Clols, Elisa
dc.contributor.authorMorros Rubió, Josep Ramon
dc.contributor.authorHernando Pericás, Francisco Javier
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.identifier.citationIndia, M. [et al.]. UPC multimodal speaker diarization system for the 2018 Albayzin challenge. A: International Conference on Advances in Speech and Language Technologies for Iberian Languages. "IberSPEECH 2018: program and proceedings: 21-23 November 2018: Barcelona, Spain". Baixas: International Speech Communication Association (ISCA), 2018, p. 199-203.
dc.description.abstractThis paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.
dc.format.extent5 p.
dc.publisherInternational Speech Communication Association (ISCA)
dc.subjectÀrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
dc.subject.lcshAutomatic speech recognition
dc.subject.otherSpeaker diarization
dc.subject.otherFace diarization
dc.subject.otherMultimodal system
dc.titleUPC multimodal speaker diarization system for the 2018 Albayzin challenge
dc.typeConference report
dc.subject.lemacReconeixement automàtic de la parla
dc.contributor.groupUniversitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo
dc.contributor.groupUniversitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
dc.description.peerreviewedPeer Reviewed
dc.rights.accessOpen Access
dc.description.versionPostprint (published version)
local.citation.authorIndia, M.; Sagastiberri, I.; Palau, P.; Sayrol, E.; Morros, J.R.; Hernando, J.
local.citation.contributorInternational Conference on Advances in Speech and Language Technologies for Iberian Languages
local.citation.publicationNameIberSPEECH 2018: program and proceedings: 21-23 November 2018: Barcelona, Spain

Files in this item


This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder