Maja Taseska and Emanuel A. P. Habets
Published in the IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 25, Issue 11, pp. 2223 - 2236, Nov. 2017.
Noise power spectral density (PSD) matrix estimation is one of the most important components of a multi-channel blind speech extraction framework, as it directly determines the extracted signal quality at the output of a spatial filter. Optimality of well-known spatial filters, such as the multichannel Wiener filter, is only ensured if the PSD matrix estimates are accurate. In practical situations, where the noise is non-stationary, temporal averaging over time frames were the desired signal is inactive does not provide sufficiently fast tracking of the noise PSD matrix, resulting in high residual noise at the spatial filter output. Therefore, approaches that estimate the PSD matrices using narrowband signal detection have been proposed.
We focus on speech presence probability (SPP)-based PSD matrix estimation, following the well-known single- and multi-channel minima-controlled recursive averaging (MCRA). The SPP-based approach is suitable for blind scenarios where the location and the propagation vector of the desired speech source are unknown. The main contributions of the paper are a maximum likelihood interpretation of the multi-channel MCRA and a signal-to-diffuse ratio-based a priori SPP estimator. The latter is a key parameter that determines the accuracy of the noise PSD matrix estimates in non-stationary scenarios. In this work, we confirm the importance of the a priori SPP and show that its control is crucial for source extraction in non-stationary environments.
The experiments were done using using simulated room impulse responses (RIRs) using the simulator  for a room with dimensions [7.5 4,5 3] m.
A uniform linear array with four microphones and inter-microphone distance of 5 cm was used.
Diffuse sound, generated using , and uncorrelated Gaussian noise were added. The signal-to-noise ratio with respect to the sensor noise was fixed to 35 dB.
The signals were sampled at 16 kHz, segmented by 64 ms Hamming windows with 50 % overlap and transformed to the STFT domain using the fast Fourier transform.
The purpose of the below presented audio examples is to compare the overall quality of the extracted source signal when applying data-dependent spatial filters computed using the SPP-based PSD matrix estimators discussed in the paper. The accuracy of the estimated SPP is the key factor determining the final quality of the extracted signals. Based on a Gaussian signal model, the expression of the a posteriori SPP given the microphone signals follows from basic probability axioms. However, the challenge in obtaining an accurate SPP estimate in practice, comes from the fact that an estimate of the noise PSD matrix and the a priori SPP are necessary. As discussed in the current paper, as well as in many recent works on noise PSD estimation, the a priori SPP is crucial for robust estimation of the a posteriori SPP in non-stationary environments. Hence, the different frameworks compared in the paper (and the audio examples bellow) differ in a priori SPP estimation. We compare the following approaches
SC-Cohen: Following the original MCRA approach , we compute the a priori SPP using a single microphone, and use it within the multichannel framework for noise PSD matrix estimation and source extraction. To implement this framework we used the code available online by the author of .
Maximum likelihood (ML): The ML approach provides a specific expression of the a priori SPP from a properly defined ML optimization problem, as derived in the paper.
MC-Souden: In the multi-channel extension to single-channel MCRA in , the authors have a multichannel a priori SPP estimator which is based on instantaneous, and time-averaged signal-to-noise ratio estimates. We implemented the framework following the description in .
Further details and parameters associated to the algorithms and implementation are provided in the paper.
In this example, stationary with white spectrum was added to the microphone signals.
In this example, diffuse babble noise was added to the signals.
In this example, fan noise was added to the signals (rather diffuse but not ideally diffuse, see paper for further description)
In this example, both fan noise as well as ideally diffuse babble noise were added to the signal
 I. Cohen, Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging, IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466–475, Sept. 2003.
 M. Souden, J. Chen, J. Benesty, and S. Affes, An integrated solution for online multichannel noise tracking and reduction, IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2159–2169, Sep. 2011.
 E. A. P. Habets, Room impulse response generator, Tech. Rep., Technische Universiteit Eindhoven, 2006. RIR generator available online at https://www.audiolabs-erlangen.de/fau/professor/habets/software/rir-generator
 E. A. P. Habets and S. Gannot, Generating sensor signals in isotropic noise fields, J. Acoust. Soc. Am., vol. 122, no. 6, pp. 3464–3470, Dec., 2007. Available online at https://github.com/ehabets/INF-Generator