Detection and handling of overlapping speech for speaker diarization

Zelenák, Martin

doi:10.5821/dissertation-2117-94515

dc.contributor	Hernando Pericás, Francisco Javier
dc.contributor.author	Zelenák, Martin
dc.contributor.other	Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.date.accessioned	2012-02-15T09:05:01Z
dc.date.available	2012-02-15T09:05:01Z
dc.date.issued	2012-01-31
dc.identifier.citation	Zelenák, M. Detection and handling of overlapping speech for speaker diarization. Tesi doctoral, UPC, Departament de Teoria del Senyal i Comunicacions, 2012. DOI 10.5821/dissertation-2117-94515.
dc.identifier.uri	http://hdl.handle.net/2117/94515
dc.description.abstract	For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings, compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also due to the presence of overlapping speech. Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually lead to corrupt single-speaker models and thus to a worse segmentation. This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component analysis, linear discriminant analysis, or by a multi-layer perceptron. In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then, a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features. The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the detection of overlapping speech, especially on data originating from a single recording site. In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of the diarization algorithm. During the system development it was discovered that it is favorable to do an independent optimization of overlap exclusion and labeling with respect to the overlap detection system. We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well. The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between various settings of the overlap detection operation point. However, a high performance variability across different recordings is also typical of the baseline diarization system, without any overlap handling.
dc.format.extent	154 p.
dc.language.iso	eng
dc.publisher	Universitat Politècnica de Catalunya
dc.rights	ADVERTIMENT. L'accés als continguts d'aquesta tesi doctoral i la seva utilització ha de respectar els drets de la persona autora. Pot ser utilitzada per a consulta o estudi personal, així com en activitats o materials d'investigació i docència en els termes establerts a l'art. 32 del Text Refós de la Llei de Propietat Intel·lectual (RDL 1/1996). Per altres utilitzacions es requereix l'autorització prèvia i expressa de la persona autora. En qualsevol cas, en la utilització dels seus continguts caldrà indicar de forma clara el nom i cognoms de la persona autora i el títol de la tesi doctoral. No s'autoritza la seva reproducció o altres formes d'explotació efectuades amb finalitats de lucre ni la seva comunicació pública des d'un lloc aliè al servei TDX. Tampoc s'autoritza la presentació del seu contingut en una finestra o marc aliè a TDX (framing). Aquesta reserva de drets afecta tant als continguts de la tesi com als seus resums i índexs.
dc.source	TDX (Tesis Doctorals en Xarxa)
dc.subject	Àrees temàtiques de la UPC::Enginyeria de la telecomunicació
dc.subject.other	Overlapping speech detection
dc.subject.other	Speaker overlap
dc.subject.other	Speaker diarization
dc.subject.other	Spatial features
dc.subject.other	Cross-correlation
dc.subject.other	Prosody
dc.title	Detection and handling of overlapping speech for speaker diarization
dc.type	Doctoral thesis
dc.subject.lemac	Processament de la parla lemac
dc.identifier.doi	10.5821/dissertation-2117-94515
dc.identifier.dl	B. 8273-2012
dc.rights.access	Open Access
dc.description.version	Postprint (published version)
dc.identifier.tdx	http://hdl.handle.net/10803/72431

Fitxers d'aquest items

Nom:: TMZ1de1.pdf
Mida:: 1,226Mb
Format:: PDF

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Departament de Teoria del Senyal i Comunicacions [346]
Totes les tesis [5.446]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Detection and handling of overlapping speech for speaker diarization

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora