Wav2Pix: speech-conditioned face generation using generative adversarial networks
dc.contributor.author | Cardoso Duarte, Amanda |
dc.contributor.author | Roldan, Francisco |
dc.contributor.author | Tubau, Miquel |
dc.contributor.author | Escur, Janna |
dc.contributor.author | Pascual de la Puente, Santiago |
dc.contributor.author | Salvador Aguilera, Amaia |
dc.contributor.author | Mohedano, Eva |
dc.contributor.author | McGuinness, Kevin |
dc.contributor.author | Torres Viñals, Jordi |
dc.contributor.author | Giró Nieto, Xavier |
dc.contributor.other | Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors |
dc.contributor.other | Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions |
dc.date.accessioned | 2019-07-30T07:53:22Z |
dc.date.issued | 2019 |
dc.identifier.citation | Cardoso, A. [et al.]. Wav2Pix: speech-conditioned face generation using generative adversarial networks. A: IEEE International Conference on Acoustics, Speech, and Signal Processing. "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: May 12-17, 2019: Brighton Conference Centre, Brighton, United Kingdom". Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 8633-8637. |
dc.identifier.isbn | 978-1-4799-8131-1 |
dc.identifier.other | https://imatge.upc.edu/web/publications/wav2pix-speech-conditioned-face-generation-using-generative-adversarial-networks |
dc.identifier.uri | http://hdl.handle.net/2117/167073 |
dc.description.abstract | Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals. |
dc.format.extent | 5 p. |
dc.language.iso | eng |
dc.publisher | Institute of Electrical and Electronics Engineers (IEEE) |
dc.subject | Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic |
dc.subject | Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Reconeixement de formes |
dc.subject.lcsh | Machine learning |
dc.subject.lcsh | Computer vision |
dc.subject.other | Face |
dc.subject.other | Videos |
dc.subject.other | Generators |
dc.subject.other | Visualization |
dc.subject.other | Feature extraction |
dc.subject.other | Generative adversarial networks |
dc.subject.other | Deep learning |
dc.subject.other | Adversarial learning |
dc.subject.other | Face synthesis |
dc.subject.other | Computer vision. |
dc.title | Wav2Pix: speech-conditioned face generation using generative adversarial networks |
dc.type | Conference lecture |
dc.subject.lemac | Aprenentatge automàtic |
dc.subject.lemac | Visió per ordinador |
dc.contributor.group | Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla |
dc.contributor.group | Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
dc.contributor.group | Universitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo |
dc.identifier.doi | 10.1109/ICASSP.2019.8682970 |
dc.description.peerreviewed | Peer Reviewed |
dc.relation.publisherversion | https://ieeexplore.ieee.org/document/8682970 |
dc.rights.access | Restricted access - publisher's policy |
local.identifier.drac | 25136752 |
dc.description.version | Postprint (published version) |
dc.relation.projectid | info:eu-repo/grantAgreement/EC/H2020/713673/EU/Innovative doctoral programme for talented early-stage researchers in Spanish host organisations excellent in the areas of Science, Technology, Engineering and Mathematics (STEM)./INPhINIT |
dc.relation.projectid | info:eu-repo/grantAgreement/MINECO/1PE/TEC2015-69266-P |
dc.relation.projectid | info:eu-repo/grantAgreement/MINECO/1PE/TEC2016-75976-R |
dc.relation.projectid | info:eu-repo/grantAgreement/MINECO/1PE/TIN2015-65316-P |
dc.date.lift | 10000-01-01 |
local.citation.author | Cardoso, A.; Roldan, F.; Tubau, M.; Escur, J.; Pascual, S.; Salvador, A.; Mohedano, E.; McGuinness, K.; Torres, J.; Giro, X. |
local.citation.contributor | IEEE International Conference on Acoustics, Speech, and Signal Processing |
local.citation.publicationName | 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: May 12-17, 2019: Brighton Conference Centre, Brighton, United Kingdom |
local.citation.startingPage | 8633 |
local.citation.endingPage | 8637 |
Files in this item
This item appears in the following Collection(s)
All rights reserved. This work is protected by the corresponding intellectual and industrial
property rights. Without prejudice to any existing legal exemptions, reproduction, distribution, public
communication or transformation of this work are prohibited without permission of the copyright holder