Wav2Pix: speech-conditioned face generation using generative adversarial networks

Cardoso Duarte, Amanda; Roldan, Francisco; Tubau, Miquel; Escur, Janna; Pascual de la Puente, Santiago; Salvador Aguilera, Amaia; Mohedano, Eva; McGuinness, Kevin; Torres Viñals, Jordi; Giró Nieto, Xavier

doi:10.1109/ICASSP.2019.8682970

dc.contributor.author	Cardoso Duarte, Amanda
dc.contributor.author	Roldan, Francisco
dc.contributor.author	Tubau, Miquel
dc.contributor.author	Escur, Janna
dc.contributor.author	Pascual de la Puente, Santiago
dc.contributor.author	Salvador Aguilera, Amaia
dc.contributor.author	Mohedano, Eva
dc.contributor.author	McGuinness, Kevin
dc.contributor.author	Torres Viñals, Jordi
dc.contributor.author	Giró Nieto, Xavier
dc.contributor.other	Universitat Politècnica de Catalunya. Doctorat en Teoria del Senyal i Comunicacions
dc.contributor.other	Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor.other	Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.date.accessioned	2019-07-30T07:53:22Z
dc.date.issued	2019
dc.identifier.citation	Cardoso, A. [et al.]. Wav2Pix: speech-conditioned face generation using generative adversarial networks. A: IEEE International Conference on Acoustics, Speech, and Signal Processing. "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: May 12-17, 2019: Brighton Conference Centre, Brighton, United Kingdom". Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 8633-8637.
dc.identifier.isbn	978-1-4799-8131-1
dc.identifier.other	https://imatge.upc.edu/web/publications/wav2pix-speech-conditioned-face-generation-using-generative-adversarial-networks
dc.identifier.uri	http://hdl.handle.net/2117/167073
dc.description.abstract	Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals.
dc.format.extent	5 p.
dc.language.iso	eng
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.subject	Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
dc.subject	Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Reconeixement de formes
dc.subject.lcsh	Machine learning
dc.subject.lcsh	Computer vision
dc.subject.other	Face
dc.subject.other	Videos
dc.subject.other	Generators
dc.subject.other	Visualization
dc.subject.other	Feature extraction
dc.subject.other	Generative adversarial networks
dc.subject.other	Deep learning
dc.subject.other	Adversarial learning
dc.subject.other	Face synthesis
dc.subject.other	Computer vision.
dc.title	Wav2Pix: speech-conditioned face generation using generative adversarial networks
dc.type	Conference lecture
dc.subject.lemac	Aprenentatge automàtic
dc.subject.lemac	Visió per ordinador
dc.contributor.group	Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla
dc.contributor.group	Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.contributor.group	Universitat Politècnica de Catalunya. GPI - Grup de Processament d'Imatge i Vídeo
dc.identifier.doi	10.1109/ICASSP.2019.8682970
dc.description.peerreviewed	Peer Reviewed
dc.relation.publisherversion	https://ieeexplore.ieee.org/document/8682970
dc.rights.access	Restricted access - publisher's policy
local.identifier.drac	25136752
dc.description.version	Postprint (published version)
dc.relation.projectid	info:eu-repo/grantAgreement/EC/H2020/713673/EU/Innovative doctoral programme for talented early-stage researchers in Spanish host organisations excellent in the areas of Science, Technology, Engineering and Mathematics (STEM)./INPhINIT
dc.relation.projectid	info:eu-repo/grantAgreement/MINECO//TEC2015-69266-P/ES/TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO/
dc.relation.projectid	info:eu-repo/grantAgreement/MINECO/1PE/TEC2016-75976-R
dc.relation.projectid	info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/
dc.date.lift	10000-01-01
local.citation.author	Cardoso, A.; Roldan, F.; Tubau, M.; Escur, J.; Pascual, S.; Salvador, A.; Mohedano, E.; McGuinness, K.; Torres, J.; Giro, X.
local.citation.contributor	IEEE International Conference on Acoustics, Speech, and Signal Processing
local.citation.publicationName	2019 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: May 12-17, 2019: Brighton Conference Centre, Brighton, United Kingdom
local.citation.startingPage	8633
local.citation.endingPage	8637

Fitxers d'aquest items

Nom:: 1903.10195.pdf
Mida:: 4,422Mb
Format:: PDF
Descripció:: Preprint

Visualitza/Obre

Aquest ítem apareix a les col·leccions següents

Ponències/Comunicacions de congressos [233]
Ponències/Comunicacions de congressos [317]
Ponències/Comunicacions de congressos [784]
Ponències/Comunicacions de congressos [437]
Ponències/Comunicacions de congressos [1.954]
Ponències/Comunicacions de congressos [3.323]

Mostra el registre d'ítem simple

UPCommons. Portal del coneixement obert de la UPC

Wav2Pix: speech-conditioned face generation using generative adversarial networks

Fitxers d'aquest items

Aquest ítem apareix a les col·leccions següents

Explora