Wav2Pix: speech-conditioned face generation using generative adversarial networks

Cardoso Duarte, Amanda; Roldan, Francisco; Tubau, Miquel; Escur, Janna; Pascual de la Puente, Santiago; Salvador Aguilera, Amaia; Mohedano, Eva; McGuinness, Kevin; Torres Viñals, Jordi; Giró Nieto, Xavier

doi:10.1109/ICASSP.2019.8682970

Visualitza/Obre

Preprint (4,422Mb) (Accés restringit) Sol·licita una còpia a l'autor

Veure estadístiques d'ús d'UPCommons

Estadístiques de LA Referencia / Recolecta

Cita com:

Mostra el registre d'ítem complet

Cardoso Duarte, Amanda

Roldan, Francisco

Tubau, Miquel

Escur, Janna

Pascual de la Puente, Santiago

Salvador Aguilera, Amaia

Tipus de documentComunicació de congrés

Data publicació2019

EditorInstitute of Electrical and Electronics Engineers (IEEE)

Condicions d'accésAccés restringit per política de l'editorial

Tots els drets reservats. Aquesta obra està protegida pels drets de propietat intel·lectual i industrial corresponents. Sense perjudici de les exempcions legals existents, queda prohibida la seva reproducció, distribució, comunicació pública o transformació sense l'autorització del titular dels drets

ProjecteINPhINIT - Innovative doctoral programme for talented early-stage researchers in Spanish host organisations excellent in the areas of Science, Technology, Engineering and Mathematics (STEM). (EC-H2020-713673)
TECNOLOGIAS DE APRENDIZAJE PROFUNDO APLICADAS AL PROCESADO DE VOZ Y AUDIO (MINECO-TEC2015-69266-P)
COMPUTACION DE ALTAS PRESTACIONES VII (MINECO-TIN2015-65316-P)

Abstract

Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of youtubers with notable expressiveness in both the speech and visual signals.

CitacióCardoso, A. [et al.]. Wav2Pix: speech-conditioned face generation using generative adversarial networks. A: IEEE International Conference on Acoustics, Speech, and Signal Processing. "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: May 12-17, 2019: Brighton Conference Centre, Brighton, United Kingdom". Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 8633-8637.

URIhttp://hdl.handle.net/2117/167073

DOI10.1109/ICASSP.2019.8682970

ISBN978-1-4799-8131-1

Versió de l'editorhttps://ieeexplore.ieee.org/document/8682970

Altres identificadorshttps://imatge.upc.edu/web/publications/wav2pix-speech-conditioned-face-generation-using-generative-adversarial-networks

Col·leccions

Veure estadístiques d'ús d'UPCommons

Mostra el registre d'ítem complet

Fitxers	Descripció	Mida	Format	Visualitza
1903.10195.pdf	Preprint	4,422Mb	PDF	Accés restringit

UPCommons. Portal del coneixement obert de la UPC

Wav2Pix: speech-conditioned face generation using generative adversarial networks

Visualitza/Obre

Explora