Audio-visual deep learning

Nicolai Pedersen

Everyday speech perception relies not only on hearing but also on seeing a talker’s face. In busy multi-talker scenes, hearing-impaired individuals in particular rely on visual perception when listening to speech. Deep learning algorithms are currently evolving rapidly and are being used to extract high-level speech features for robust automated speech recognition. This project uses multi-modal deep neural networks to extract and investigate audio-visual speech features. In the first part of the project, neural networks algorithms are used to associate real-time audio streams and with facial features in multi-talker audio-visual scenarios. In a second part, the project focuses on using a video input to assist audio source separation with deep neural networks. Such networks exploit correlations between higher-level visual and auditory features in speech to extract and enhance the sounds of individual speakers in competing talker situations.

Supervisor: Torsten Dau
Co-supervisors: Jens Hjortkjær (DTU Elektro) Lars Kai Hansen (DTU Compute)

DTU Orbit


Nicolai Pedersen
PhD student
DTU Health Tech