UPC multimodal speaker diarization system for the 2018 Albayzin challenge
Document typeConference report
PublisherInternational Speech Communication Association (ISCA)
Rights accessOpen Access
This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.
CitationIndia, M. [et al.]. UPC multimodal speaker diarization system for the 2018 Albayzin challenge. A: International Conference on Advances in Speech and Language Technologies for Iberian Languages. "IberSPEECH 2018: program and proceedings: 21-23 November 2018: Barcelona, Spain". Baixas: International Speech Communication Association (ISCA), 2018, p. 199-203.