Show simple item record

dc.contributor.authorCardoso Duarte, Amanda
dc.contributor.authorRoldan, Francisco
dc.contributor.authorTubau, Miquel
dc.contributor.authorEscur, Janna
dc.contributor.authorPascual de la Puente, Santiago
dc.contributor.authorSalvador Aguilera, Amaia
dc.contributor.authorMohedano, Eva
dc.contributor.authorMcGuinness, Kevin
dc.contributor.authorTorres Viñals, Jordi
dc.contributor.authorGiró Nieto, Xavier
dc.contributor.otherUniversitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor.otherUniversitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.date.accessioned2019-07-30T07:53:22Z
dc.date.issued2019
dc.identifier.citationCardoso, A. [et al.]. Wav2Pix: speech-conditioned face generation using generative adversarial networks. A: IEEE International Conference on Acoustics, Speech, and Signal Processing. "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: May 12-17, 2019: Brighton Conference Centre, Brighton, United Kingdom". Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 8633-8637.
dc.identifier.isbn978-1-4799-8131-1
dc.identifier.otherhttps://imatge.upc.edu/web/publications/wav2pix-speech-conditioned-face-generation-using-generative-adversarial-networks
dc.identifier.urihttp://hdl.handle.net/2117/167073
dc.description.abstractSpeech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals.
dc.format.extent5 p.
dc.language.isoeng
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)
dc.subjectÀrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
dc.subjectÀrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Reconeixement de formes
dc.subject.lcshMachine learning
dc.subject.lcshComputer vision
dc.subject.otherFace
dc.subject.otherVideos
dc.subject.otherGenerators
dc.subject.otherVisualization
dc.subject.otherFeature extraction
dc.subject.otherGenerative adversarial networks
dc.subject.otherDeep learning
dc.subject.otherAdversarial learning
dc.subject.otherFace synthesis
dc.subject.otherComputer vision.
dc.titleWav2Pix: speech-conditioned face generation using generative adversarial networks
dc.typeConference lecture
dc.subject.lemacAprenentatge automàtic
dc.subject.lemacVisió per ordinador
dc.contributor.groupUniversitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
dc.contributor.groupUniversitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.contributor.groupUniversitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo
dc.identifier.doi10.1109/ICASSP.2019.8682970
dc.description.peerreviewedPeer Reviewed
dc.relation.publisherversionhttps://ieeexplore.ieee.org/document/8682970
dc.rights.accessRestricted access - publisher's policy
local.identifier.drac25136752
dc.description.versionPostprint (published version)
dc.relation.projectidinfo:eu-repo/grantAgreement/EC/H2020/713673/EU/Innovative doctoral programme for talented early-stage researchers in Spanish host organisations excellent in the areas of Science, Technology, Engineering and Mathematics (STEM)./INPhINIT
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO/1PE/TEC2015-69266-P
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO/1PE/TEC2016-75976-R
dc.relation.projectidinfo:eu-repo/grantAgreement/MINECO/1PE/TIN2015-65316-P
dc.date.lift10000-01-01
local.citation.authorCardoso, A.; Roldan, F.; Tubau, M.; Escur, J.; Pascual, S.; Salvador, A.; Mohedano, E.; McGuinness, K.; Torres, J.; Giro, X.
local.citation.contributorIEEE International Conference on Acoustics, Speech, and Signal Processing
local.citation.publicationName2019 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: May 12-17, 2019: Brighton Conference Centre, Brighton, United Kingdom
local.citation.startingPage8633
local.citation.endingPage8637


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

All rights reserved. This work is protected by the corresponding intellectual and industrial property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public communication or transformation of this work are prohibited without permission of the copyright holder