Time-frequency masking based online speech enhancement with multi-channel data using convolutional neural networks

Soumitro Chakrabarty, DeLiang Wang and Emanuël Habets

Published in the Proc. of International Workshop on Acoustic Signal Enhancement (IWAENC), 2018.


Speech enhancement in noisy and reverberant conditions remains a challenging task. In this work, a time-frequency masking based method for speech enhancement with multi-channel data using convolutional neural networks (CNN) is proposed, where the CNN is trained to estimate the ideal ratio mask by discriminating directional speech source from diffuse or spatially uncorrelated noises. The proposed method operates on, frame-by-frame, the magnitude and phase components of the short-time Fourier transform coefficients of all frequency sub-bands and microphones. The avoidance of temporal context and explicit feature extraction makes the proposed method suitable for online implementation. In contrast to most speech enhancement methods that utilize multi-channel data, the proposed method does not require information about the spatial position of the desired speech source. Through experimental evaluation with both simulated and real data, we show the robustness of the proposed method to unseen acoustic conditions as well as varying noise levels.

Sound Example - Simulated Setup

To highlight the difference in performance of the methods in different scenarios, in this part we present audio examples for the simulated setup for two different noisy scenarios, where in the first case the babble noise level is high whereas in the second case, the microphone self-noise is high. Both examples were recorded in the same room.

Experimental Setup:

  • Room size: 9 m x 7 m x 3 m
  • Source-microphone distance: 1.7 m
  • Reverberation time: 0.7 s
  • Source DOA: 20 degrees

  • Babble noise with iSNR = -6 dB
  • Microphone Self-noise with iSNR = 40 dB
  • Babble noise with iSNR = 6 dB
  • Microphone Self-noise with iSNR = 10 dB

Sound Example - Measured RIRs

The measured RIRs were taken from the Multi-channel Impulse Response Database, which were recorded in an acoustic lab at the Bar-Ilan University.

Experimental Setup:

  • Room size: 6 m x 6 m x 2.4 m
  • Source-microphone distance: 2 m
  • Babble noise with iSNR = 0 dB
  • Microphone Self-noise with iSNR = 10 dB
  • Source DOA = 105 degrees

Reverberation time: 0.360 s

Reverberation time: 0.610 s