Soumitro Chakrabarty, DeLiang Wang and Emanuël Habets
Published in the Proc. of International Workshop on Acoustic Signal Enhancement (IWAENC), 2018.
Speech enhancement in noisy and reverberant conditions remains a challenging task. In this work, a time-frequency masking based method for speech enhancement with multi-channel data using convolutional neural networks (CNN) is proposed, where the CNN is trained to estimate the ideal ratio mask by discriminating directional speech source from diffuse or spatially uncorrelated noises. The proposed method operates on, frame-by-frame, the magnitude and phase components of the short-time Fourier transform coefficients of all frequency sub-bands and microphones. The avoidance of temporal context and explicit feature extraction makes the proposed method suitable for online implementation. In contrast to most speech enhancement methods that utilize multi-channel data, the proposed method does not require information about the spatial position of the desired speech source. Through experimental evaluation with both simulated and real data, we show the robustness of the proposed method to unseen acoustic conditions as well as varying noise levels.
To highlight the difference in performance of the methods in different scenarios, in this part we present audio examples for the simulated setup for two different noisy scenarios, where in the first case the babble noise level is high whereas in the second case, the microphone self-noise is high. Both examples were recorded in the same room.
Source DOA: 20 degrees
The measured RIRs were taken from the Multi-channel Impulse Response Database, which were recorded in an acoustic lab at the Bar-Ilan University.