|dc.contributor.author||Hernanz Nogueras, Sergi
|dc.description||Projecte final de carrera fet en col.laboració amb l'University of Southern California
|dc.description.abstract||All along the current project, the speaker recognition is being reviewed. First
simulations in this work use the latest ‘state of the art’ algorithms, and later new
approaches and lots of modifications are used. Multimodality is the main idea to
achieve better results. The new multimodal data supplied to the speaker
recognition system will be articulatory features and video+voice source
localization in the meeting room scenario. Some articulatory features have not
been widely used for speech analysis so the correct extraction methods are still
not developed. On the other hand, voice source and video spatial localization
algorithms are known and only the integration methods have to be defined.
Theoretical review and a study about integration will follow before finally
selecting an algorithm.
Machine learning techniques are applied to extract articulatory features, which
perform a surprisingly right classification. The usability of those feature extractor
outputs for the speaker recognition issue is not that clear, but very important
conclusions are set about how the extraction process can affect the posterior
usage and how other extraction methods could be approached.
During the work, articulatory features demonstrate to be less affected by noise
than the baseline MFCC+GMM approach, but the correct extraction methods
are still not available. Even using the baseline extraction methods based on
MLP, a classification is possible using the articulatory features, and
complementarities with baseline methods are demonstrated. The improvement
of the whole system adding articulatory features is very small, but demonstrates
their usability. The whole process of the articulatory feature integration can
surely be reviewed expecting successful results in the future.
Due to an extended analysis of how noise poisons the speech features, very
concrete conclusions are set about noise rejection and affection. By plotting
how the system works against different SNR conditions, behaviors of some
methods are explained. In low SNR conditions, very simple changes in the
algorithms improve the overall performance, and reveal the lack of noiseoriented
design of the baseline.
The most of the methods approached in the current work were finally applied to
the meeting room scenario at USC. An encouraging but small performance
increase was achieved, and so the aim of the current work was considered
realized. The trade-off between the spent effort and the small improvement is to
be reviewed with further approaches and work.
|dc.publisher||Universitat Politècnica de Catalunya
|dc.rights||Attribution-NonCommercial-NoDerivs 3.0 Spain
|dc.subject||Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic
|dc.subject.lcsh||Speech processing systems
|dc.title||Robust feature extraction for multimodal speaker ID system – The experts’ room
|dc.type||Master thesis (pre-Bologna period)
|dc.subject.lemac||Processament de la parla
|dc.audience.educationlevel||Estudis de primer/segon cicle
|dc.audience.mediator||Escola Tècnica Superior d'Enginyeria de Telecomunicació de Barcelona
|dc.audience.degree||ENGINYERIA DE TELECOMUNICACIÓ (Pla 1992)