Detection and handling of overlapping speech for speaker diarization
Visualitza/Obre
10.5821/dissertation-2117-94515
Inclou dades d'ús des de 2022
Cita com:
hdl:2117/94515
Càtedra / Departament / Institut
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
Tipus de documentTesi
Data de defensa2012-01-31
EditorUniversitat Politècnica de Catalunya
Condicions d'accésAccés obert
Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i
industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva
reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets
Abstract
For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken
language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings,
compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also
due to the presence of overlapping speech.
Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a
substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually
only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually
lead to corrupt single-speaker models and thus to a worse segmentation.
This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker
diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on
distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component
analysis, linear discriminant analysis, or by a multi-layer perceptron.
In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set
of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then,
a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features.
The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in
the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the
detection of overlapping speech, especially on data originating from a single recording site.
In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments
are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of
the diarization algorithm. During the system development it was discovered that it is favorable to do an independent
optimization of overlap exclusion and labeling with respect to the overlap detection system.
We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments
with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well.
The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the
clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the
overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between
various settings of the overlap detection operation point. However, a high performance variability across different recordings is
also typical of the baseline diarization system, without any overlap handling.
CitacióZelenák, M. Detection and handling of overlapping speech for speaker diarization. Tesi doctoral, UPC, Departament de Teoria del Senyal i Comunicacions, 2012. DOI 10.5821/dissertation-2117-94515. Disponible a: <http://hdl.handle.net/2117/94515>
Dipòsit legalB. 8273-2012
Col·leccions
Fitxers | Descripció | Mida | Format | Visualitza |
---|---|---|---|---|
TMZ1de1.pdf | 1,226Mb | Visualitza/Obre |