Mostra el registre d'ítem simple

dc.contributor.authorIndia Massana, Miquel Àngel
dc.contributor.authorHernando Pericás, Francisco Javier
dc.contributor.authorRodríguez Fonollosa, José Adrián
dc.contributor.otherUniversitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.date.accessioned2022-10-06T10:54:06Z
dc.date.available2022-10-06T10:54:06Z
dc.date.issued2023-03
dc.identifier.citationIndia, M.; Hernando, J.; Fonollosa, J.A.R. Language modelling for speaker diarization in telephonic interviews. "Computer speech and language", Març 2023, vol. 78, article 101441, p. 1-12.
dc.identifier.issn0885-2308
dc.identifier.urihttp://hdl.handle.net/2117/374077
dc.description.abstractThe aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.
dc.description.sponsorshipThis work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.
dc.format.extent12 p.
dc.language.isoeng
dc.publisherElsevier
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subjectÀrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
dc.subject.lcshSpeech processing systems
dc.subject.lcshNeural networks (Computer science)
dc.subject.otherSpeaker diarization
dc.subject.otherLanguage modelling
dc.subject.otherAcoustic modelling
dc.subject.otherLSTM neural networks
dc.titleLanguage modelling for speaker diarization in telephonic interviews
dc.typeArticle
dc.subject.lemacProcessament de la parla
dc.subject.lemacXarxes neuronals (Informàtica)
dc.contributor.groupUniversitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
dc.identifier.doi10.1016/j.csl.2022.101441
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttps://www.sciencedirect.com/science/article/pii/S0885230822000651
dc.rights.accessOpen Access
local.identifier.drac34244864
dc.description.versionPostprint (published version)
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/
dc.relation.projectidinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-107579RB-I00/ES/ARQUITECTURAS AVANZADAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ, AUDIO Y LENGUAJE/
local.citation.authorIndia, M.; Hernando, J.; Fonollosa, José A. R.
local.citation.publicationNameComputer speech and language
local.citation.volume78
local.citation.numberarticle 101441
local.citation.startingPage1
local.citation.endingPage12


Fitxers d'aquest items

Thumbnail

Aquest ítem apareix a les col·leccions següents

Mostra el registre d'ítem simple