International Audio Laboratories Erlangen

Prof. Dr. ir. Emanuël Habets


Wolfgang Mack


Wolfgang Mack, Adrian Herzog


Wolfgang Mack,
Adrian Herzog

Friedrich-Alexander Universität Erlangen-Nürnberg

International Audio Laboratories Erlangen

Am Wolfsmantel 33, 91058 Erlangen

Python: How to ...

Speech Enhancement Using Microphone Arrays

The goal of this notebook is to gain insights into speech enhancement techniques using microphone arrays. A code-skeletton is provided which you have to complete. First, a scenario overview is given.


This module is designed to give a practical understanding of performing speech enhancement using microphone arrays. It is is closely related to the lecture 'Speech Enhancement' by Prof. Dr. ir. Emanuël Habets.
In this exercise, you have to implement a commonly used spatial signal processing technique known as beamforming, and analyse the performance of two different beamformers, a signal-independent beamformer known as delay-and-sum beamformer (DSB) and a signal-dependent beamformer known as minimum variance distortionless response (MVDR) beamformer, for a noise and interference reduction task. A free-field scenario is considered, i.e., no reflections or shieldings caused by walls or obstacles are considered. The beamformer performances are compared via objective measures to demonstrate their advantages with respect to the specific reduction task.

Part 0: Introduction

Microphone arrays play an important role in applications like hands-free communication in smartphones, smart speakers, and teleconferencing systems. The spatial diversity offered by microphone arrays can be used to extract the speech signal of interest from the microphone signals corrupted by noise, reverberation and interfering sources. A typical method for this is to obtain a sum of the filtered microphone signals such that the signal from the so-called look direction is maintained while signals from other directions are attenuated. Such a method is known as beamforming. This module gives a practical understanding of such methods used to perform speech enhancement. The notebook provides the theoretical background necessary to understand the formulation of the beamforming methods and the practical aspects regarding their implementation. For the experiments in this module, we consider a scenario where a microphone array on a hands-free device captures two sound sources. Additionally, each microphone signal contains white noise which models the self-noise of the microphone. The white noise is assumed to be uncorrelated between the microphones.

Figure 1 - Environment scenario with a Uniform Linear Array (ULA), an interferer, a desired source, N uncorrelated noise sources and the 2-dimensional coordinate system.

As illustrated in Figure 1, one of the sources is the desired source and the other is an interferer. The task in this experiment is to process the microphone signals to enhance the speech of the desired source while suppressing the interfering speech and noise.
The enhancement system for this module is illustrated in Figure 2.

Figure 2 - General enhancement system used in this module with number of microphones N, one-sided number of frequency bins K and number of time-frames M. N, K, and M are matrix dimensions.

The microphone signals are first transformed into the time-frequency domain via short-time Fourier transform (STFT) to obtain the input signals for the beamforming algorithms. An overview of the tasks for implementing the different beamformers in this module is given as follows:

Part 1: Signal and RTF Model for a Uniform Linear Array

Part 1.1: Signal Model

To explain the formulation of the beamforming techniques, we first need to define a signal model.

Figure 3 - Signal model for the scenario depicted in Figure 1.

Consider a discrete time-domain signal model (shown in Figure 3), where $\text{N}$ microphones capture the desired source signal and the interference in presence of additive noise. Then the $n$th microphone signal for $n=1, \ldots,N$ is given by:

\begin{align} y_n(t) &= g_{n,d}(t) * s_{d}(t) + g_{n,i}(t) * s_{i}(t) + v_n(t),\\ &= x_{n,d}(t) + x_{n,i}(t) + v_{n}(t), \tag{1} \label{eq:tdsig} \end{align} where $s_{d}(t)$ is the desired source signal and $s_{i}(t)$ is the interfering speech signal. The acoustic impulse responses between the $n$th microphone and the sources are denoted by $g_{n,d}(t)$ and $g_{n,i}(t)$ for the desired and the interfering source, respectively. The variable $v_{n}(t)$ denotes the additive noise. The desired source signal and the interfering speech signal received at the $n$th microphone are denoted by $x_{n,d}(t)$ and $x_{n,i}(t)$, respectively. The sample-time index is represented by $t$ and $*$ denotes linear convolution. Since the frequency characteristics of a speech signal vary over time, the processing of the received signals is done in the time-frequency domain. The time domain signals are transformed to the time-frequency domain via STFT. The STFT representation of (\ref{eq:tdsig}) is given by:

\begin{align} Y_n(m,k) &= G_{n,d}(m,k) S_{d}(m,k) + G_{n,i}(m,k) S_{i}(m,k) + V_n(m,k), \\ \nonumber &= X_{n,d}(m,k) + \underbrace{X_{n,i}(m,k) + V_n(m,k)}_{U_{n}(m,k)}, \nonumber \tag{2} \end{align}

where the upper-case letters denote the time-frequency domain counterparts of the terms in (\ref{eq:tdsig}) for the $k$th frequency bin, and $m$ denotes the time-frame index. $U_{n}(m,k)$ denotes the undesired signals, which is the sum of the interference and the noise signals. We can express the $N$ STFT domain microphone signals in vector notation as:

\begin{align} \mathbf{y}(m,k) &= \mathbf{g}_{d}(m,k) S_{d}(m,k) + \mathbf{u}(m,k), \\ &= \mathbf{x}_{d}(m,k) + \mathbf{u}(m,k),\nonumber\label{eq:pythagoras} \tag{3} \end{align}

where $\mathbf{y}(m,k) = \left[ Y_1(m,k), Y_2(m,k), \ldots, Y_N(m,k) \right]^T$, and $\mathbf{x}_{d}(m,k)$, $\mathbf{g}_{d}(m,k)$ and $\mathbf{u}(m,k)$ are defined similarly.

We can write the desired source signal vector $\mathbf{x}_{d}(m,k)$ as a function of the received signal at the first microphone:

\begin{equation} \mathbf{x}_{d}(m,k) = \mathbf{d}(m,k) X_{1,d}(m,k). \tag{4} \end{equation}

Using this definition, the microphone signals can also be written as

\begin{equation} \mathbf{y}(m,k) = \mathbf{d}(m,k) X_{1,d}(m,k) + \mathbf{u}(m,k), \tag{5} \end{equation}

where d is known as relative transfer function (RTF) vector

Part 1.2 RTF Model - Free-field Model

For the given signal model, we consider a free-field model where each microphone receives only the direct path signal. The RTF effects are modelled by the RTF vector $\mathbf{d}(m,k)$. The formulation of the RTF vector is described as follows.

The free-field model is illustrated in Figure 1. The vector of acoustic transfer functions for the desired source is denoted by $\mathbf{g}_{d}(k) = \left[ G_{1,d}(k), G_{2,d}(k), \ldots, G_{N,d}(k) \right]^T$. The acoustic transfer function corresponding to the $n$th microphone is given by

\begin{equation} G_{n,d}(k) = A_{n,d}(k)\text{exp} \left(j 2 \pi f_k {\tau_{n,d}} \right), \tag{6} \end{equation}

where $A_{n,d}(k)$ is the attenuation factor for the $n$th microphone due to, e.g., distance attenuation, $f_{k} = \frac{k}{K} f_{s}$ denotes the frequency value corresponding to the $k$th frequency bin, and $\tau_{n,d}$ is the propagation time of the desired source signal to the $n$th microphone. The symbols $f_{s}$ and $K$ denote the sampling frequency and the total number of frequency bins, respectively. The RTF vector for the free-field model is given by:

\begin{align} \mathbf{d}(k) &= \frac{\mathbf{g}_{d}(k)}{G_{1,d}(k)}\\ &= \left[ 1, D_2(k), \ldots, D_N(k) \right]^T , \tag{7}\label{eq:ddef_frfld} \end{align}


\begin{equation} D_n(k) = \frac{G_{n,d}(k)}{G_{1,d}(k)} = \frac{A_{n,d}(k)}{A_{1,d}(k)} \text{exp} \left(j 2 \pi f_k \Delta_{\tau_{n,d}} \right), \tag{8}\label{eq:frfldD} \end{equation}

where $\Delta_{\tau_{n,d}}$ is the time difference of arrival (TDOA) of the desired signal at the $n$th microphone with respect to the 1st microphone. This is the formulation of the RTF vector considered for the delay and sum beamformer, and the MVDR beamformer. In the free-field model, the relative transfer functions are considered to be time-independent.

Part 1.3 Far-Field Model with an Uniform Linear Array

The microphone array is assumed to be sufficiently distant to the desired source and to be sufficiently small to approximate the circular sound waves at the microphones as plane wave as illustrated in Figure 4a.

Figure 4 - Far Field Model (a) and Uniform Linear Array (ULA)

The microphone array in this module is an Uniform Linear Array (ULA) as illustrated in Figure 4b. N microphones are equidistantly stacked in a row with inter-microphone distance d. We assign indices in a rising order to the microphones starting with 1 (the furthest at the left) which we refer to as reference microphone.
Further calculations require an orientation of the ULA in the 2-dimensional vector-space defined in Figure 1. For this purpose, we define the unit-norm vector ULA_direction to point from microphone 1 to N in the coordinates of the defined vector-space.
Figure 5 shows a summary of a wave traveling from a source to a ULA with regard to our assumptions (Far-Field, Free-Field, small ULA).

Figure 5 - Summary of a ULA in a Free-Field and Far-Field Model

Video 1 - A Micophone Array in Near-Field (orange circles) and one in Far-Field (black circles) Distance to a Sound Source

Video 2 - A Plane Sine Wave Reaching Microphones (Spheres) as it is Assumed in the Far-Field

We define the angle theta in Figure 5 as direction of arrival (DOA) of the desired source. It is the angle between ULA_direction and the normal vector of the arriving wave front, denoted by $source_{dir}$ pointing in the source direction. According to the plane wave model, the DOA is equal for all microphones in the ULA. However, in reality, the incoming wave is not exactly plane but slightly curved because the distance between the desired source and the ULA is finite. To reduce the error coming from the model, the DOA is defined with respect to the ULA center. In practice the DOA is calculated as angle between the vector pointing from the ULA center to the source position and ULA_direction.

Given the DOA, the TDOA and the RTF vector defined in (8) can be calculated. The scenario allows to compute the TDOA with trigonometric functions (sine, cosine) given the DOA, the inter-microphone distance d, and the speed of sound c. The TDOA is used later to construct the RTF vector and eventually to compute the beamformer weights.

Homework Exercise 1 - 1.1 Derive a formula for the `TDOA` between the first and the second microphone based on `theta` and the inter-microphone distance `d`. Calculate the `TDOA` between the first and the third microphone for `c`$ =343 \frac{m}{s}$, `d`$=0.05 m$ and `theta`$ = \frac{pi}{4}$.
Hint: Look at the triangle with the angle `theta` in Figure 5. Can you construct a similar one with `d` and `theta`? - 1.2 How is the `TDOA` between the microphones one and two related to the TDOA between the microphones 1 and 3 or 1 and `N`? Derive the relation.
Homework Exercise 2 Given the desired source position source_pos, the ULA center position `ULA_center`, and `ULA_direction`, derive a formula for `theta` according to Figure 5. Remember, `theta` is equal for all microphones in the ULA.
Lab Experiment 1 Implement compute_doa(...). Input parameters:
`source_pos`: numpy array with the X and Y value of the desired source position
`ULA_center`: numpy array with the X and Y value of the ULA center
`ULA_direction`: unit-norm numpy array with 2 elements defining the endfire ULA orientation (left to right)
Output parameters:
`theta`:scalar angle between the vector pointing from the ULA center to the desired source and `ULA_direction`
Test 1 Test compute_doa(...): Is `theta` 45 degree for: - `source_pos` = np.array([5,5]), - `ULA_center` = np.array([0,0]) - `ULA_direction` = np.array([1,0])
Lab Experiment 2 Compute the time difference of arrival (`TDOA`) of the plane wave front between two microphones (the reference microphone with `TDOA` 0 is the left one) given `theta`, the inter-microphone distance `d`, and speed of sound `c`. Input parameters: - `theta`: scalar in radian - `d`: scalar microphone distance - `c`: scalar speed of sound Output parameters: `TDOA`: scalar time difference of arrival between two neighboring microphones Remember: The `TDOA` of each microphone is with respect to microphone 1 according to our definition (see Figure 4 and 5)
Test 2 Test compute_tdoa(...): Is `TDOA` =$-1$ for: - `theta` = $\frac{\pi}{4}$, - `d` = 0.343 - `c` = 343 For which `theta` do you expect `TDOA` to be 0? Try it!
Lab Experiment 3 Given the `TDOA` between two neighboring microphones, compute the `TDOA` of all microphones to the first microphone given a ULA (same distance between all microphones in the array, hence, same relative `TDOA` between neighboring microphones). Input parameters: - `TDOA`: time difference of arrival of a plane wave (from left to right) - `N`: total number of microphones in the ULA Output parameters: - `tdoas`: array of `TDOA`s of all `N` microphones with respect to the first one. The first value is 0 as the `TDOA` from the first to the first microphone is 0. The shape is (N,).
Test 3 Test compute_tdoa_list(...): Is `tdoa_list` =$[0.0,0.1,0.2,0.3]$ for: - `tdoa` = 0.1 - `N` = 4, i.e., a microphone array of four microphones and a time difference of arrival between neighboring microphones of 0.1 seconds

Part 2 Beamforming

2.1 How to Apply Beamformers

Our aim is to obtain an estimate of the desired source signal at the first microphone, i.e., ${X}_{1,d}(m,k)$.

Figure 6 - General block diagram for a beamformer.

This is done by applying a spatial filter to the observed microphone signals and summing across the array elements (shown in Figure 6) and is given by:

\begin{equation} \widehat{X}_{1,d}(m,k) = \mathbf{h}^{H}(m,k) \mathbf{y}(m,k), \nonumber \tag{9} \end{equation}

where $\mathbf{h}(m,k) = \left[ H_{1}^{*}(m,k), H_{2}^{*}(m,k), \ldots, H_{N}^{*}(m,k) \right]^{T}$ is a filter of length $N$ and $(\cdot)^{\text{H}}$ denotes the conjugate transpose or the Hermitian of a vector/matrix. Now, with the given framework the aim is to develop an analytic expression for the filter $\mathbf{h}(m,k)$, which is given by the different beamforming techniques explained in the Parts 2.3 and 2.5.

2.2 Relative Transfer Function Vector

The RTF vector $\mathbf{d}(k)$ is required to compute the beamformer filters in this module.

Homework Exercise 3
Using the expressions for $\mathbf{d}(k)$ and its elements, given in Eq. (7) and Eq. (8), write a simplified expression for $\mathbf{d}(k)$. Assume an attenuation free scenario.

For sources with a fixed position, $\mathbf{d}(k)$ is frequency dependent but time independent.

Lab Experiment 4 Given `tdoas`, the free-field RTF vector can be computed for all frequencies according to the equation derived in the previous Homework. Implement compute_RTF_for_freq(`frequencies` , `tdoas`) . Input parameters: - `frequencies`: frequencies to compute the RTF vector for
`tdoas`: array of the time of arrival differences (scalars) of a plane wave from source direction at the microphones Output parameters: - `RTF_vec`: array containing the phase shifts for each microphone with respect to the first microphone for all frequencies (so the shape is (K,N))
Test 4 Test compute_RTF_for_freq(frequencies, tdoa_list): Is `RTF_vec`.shape =$(1,4)$ in the next cell ?
Homework Exercise 4
Draw a flow chart with the steps to compute the RTF vector.
Lab Experiment 5 Use the flow chart of Homework Exercise 4 and the functions you implemented in the previous Lab Experiments to program compute_RTF_vector(`source_pos`, `ULA_center`, `ULA_direction`, `frequencies`, `d`, `c`, `N`). Input parameters: - See documentation above Output parameters: - `RTF_vecs`: array containing all the RTF vectors for all frequencies. It is of shape (K,N).

Part 2.3 Delay and Sum Beamformer

The delay and sum beamformer (DSB) is a fixed beamformer, i.e., the parameters of the beamformer are time-invariant and signal independent. They depend on the desired source and the ULA center positions. As the name suggests, this beamformer works by delaying the signals from certain microphones and then summing them afterwards. To explain this further, consider an array of $N$ microphones. When the microphone array picks up a signal coming from an angle other than 90 degrees, every consecutive microphone receives the signal with positive or negative delay. This is because the signal entering from an angle needs to travel an additional distance to the next microphone in the array. This fact is exploited here to obtain a constructive interference in terms of the desired signal and a destructive interference in terms of the signals from noise or interfering sources.

To obtain a constructive interference in terms of the desired source, the filter needs to satisfy

\begin{equation} \mathbf{h}^{H}(k) \mathbf{d}(k) = 1, \label{eq:DSBcon}\tag{10} \end{equation}

where $\mathbf{d}(k)$ is the RTF vector for the free-field model, given in Part 1.2.

Given the RTF vector from Part 2.2, the next task is to apply the DSB filter. Using the condition given in Equation \ref{eq:DSBcon}, the DSB filter is given by:

\begin{equation}\label{eq:appdsb} \mathbf{h}(k) = \frac{1}{N}\mathbf{d}(k). \tag{11} \end{equation}

To explain $\mathbf{h}(k)$ further, consider a single frequency k as plane wave which is captured by a ULA as in Figure 7.

Figure 7 - A single sine captured by a ULA

The ULA spatially samples the sine synchronously with N microphones yielding points with different phase offsets with respect to the first microphone due to the spatial diversity of the microphones. All phase offsets are multiples of the first phase offset between the first and the second microphone because of the plane wave assumption. These phase offsets are related to the `TDOA` and can be translated into the RTF vector $\mathbf{d}(k)$.
The general idea of the DSB is to correct the phase offset (delay the signal) by multiplying the microphone spectra with complex exponential functions (the DSB weights) to obtain constructive interference for the desired signal in the consecutive averaging step over the microphone spectra.

Lab Experiment 6 Implement the function compute_dsb_weight(`RTF_vecs`) which computes the time and signal invariant DSB weight-matrix. Input parameters: - `RTF_vecs`: array of the RTF vector of each frequency. The array is of shape (`K`,`N`). Output parameters: - `dsb_weights`: matrix of the dsb weights of each frequency with the shape (`K`,`N`)
Lab Experiment 7 Implement the function apply_dsb_weights(`dsb_weights`, `y`) which applies the dsb weight matrix to the microphone spectra and returns the enhanced spectrum. Input parameters: - `dsb_weights`: matrix of the dsb weights for each frequency with the shape (`1`,`K`, `N`) - `y`: matrix of the spectra of all `N` microphones of shape (`N`, `K`, `M`) where `M` is the total number of time frames Output parameters: - `X_1_hat`: complex matrix of a single enhanced spectrum of shape (`K`, `M`) Note: A common error is when X_1_hat.dtype is float and not complex!

You implemented all relevant functions for the DSB. To compare different beamformer parameters, a class environment is defined. We store all relevant parameter information in it (source position, ULA position, ...). You do not have to change anything here.

Now we define a class Beamformer which contains an environment and a weights matrix. The class DSB extends the class Beamformer and contains functions to compute the beamformer weights and to apply them. You already implemented those.

Three environments are created with different parameters to investigate their impact on the performance. We create three DSBs and assign the environments to them. Finally, we compute the DSB weight matrices.

Some functions to visualize the DSB power patterns are already implemented. Run the cell below and inspect the power patterns of the DSBs.







Lab Experiment 8 Look at the power patterns. Is the main lobe narrower for a large or small `d`? How does this affect your decision of Homework 5? Could you build a better beamformer if you had more microphones? In general: If the array apertur is large, is the main lobe broader or narrower? Describe the trade off between a large array apertur and a small `d` with the results of the module experiments.

Part 2.4 Spatial Aliasing

Spatial aliasing is the result of insufficient spatial sampling of a sound field. It produces angular ambiguities of a beamformer for specific frequencies and relative angles between two DOAs which makes it impossible for the beamformer to perform speech enhancement.
Subsequently, we describe angular ambiguities resulting from the ULA symmetry and spatial aliasing.
Beamformers amplify or attenuate signals dependent on the frequency and the DOA.
Not spatial aliasing:
Given a ULA, the power pattern is symmetric to the ULA axis which results in angular ambiguities, i.e., for ULA beamforming, there is a front-back ambiguity. Beamforming cannot attenuate the interferer if it is placed symmetrical to the desired source w.r.t. the ULA axis. This effect, however, is independent of the wavelength and inter-microphone distance.

Spatial aliasing:
For frequencies above a given threshold, \begin{equation} \frac{\lambda}{2}\leq d,\tag{12} \end{equation} with the wavelength $\lambda$, there are additional frequency dependent angular ambiguities which are called spatial aliases. Please notice the similarity to the Nyquist theorem for sampling in time domain. Consequently, in spatial domain the theorem is called spatial Nyquist theorem.
Subsequently, we derive spatial aliasing for ULAs and also calculate the angular position of the aliases.
For a given wave length, two DOAs, $\theta_1$ and $\theta_2$, cannot be distinguished if

\begin{equation} \frac{|d \cdot cos(\theta_1) - d \cdot cos(\theta_2)|}{\lambda} = n \text{ } \forall \text{ }n \text{ }\in \mathbb{N}^+.\tag{13} \end{equation}

Recapitulate Figure 5 and the calculation of the TDOA. $d \cdot cos(\theta)$ yields the additional distance an incoming plane wave has to travel from microphone to microphone. If the distance difference of two DOAs is a multiple of the wave length, then the beamformer cannot distinguish the DOAs for the specific frequency as they have the same phase offset when wrapped to the range $0$ and $2\pi$. Hence, spatial aliasing occurs. Furthermore, the higher the frequency, the more wave lengths fit into the difference. As a result, the number of spatial aliases increases for higher frequencies. Please consider the ULA symmetry. The difference between $\theta_1$ and $\theta_2$ cannot be greater than $\pi$.

Homework Exercise 5 Spatial aliasing occurs if $\frac{\lambda}{2} \leq d$. What to you conclude from this result. Would you select a small or a large `d` to avoid spatial aliasing?
Homework Exercise 6
At which wave length appears the second spatial alias if the first one appears at $\lambda = 10 m$. Remember: The first alias appears if the distance difference the two incoming plane waves have to travel in Equation 13 is equal to the wave length.

Please contact a tutor after you finished this part
Afterwards, listen to some examples :)

Results DSB0

N= 4

Results DSB1

N = 4
d = 0.1

Results DSB2

N = 8

Lab Experiment 9
Listen to the results above and inspect the spectrograms. To reduce white noise, averaging the microphone signals has the same effect as the DSB. Think about the disadvantage of this method and why the DSB performs better (also for which frequencies and which DOA of the desired source).
Lab Experiment 10
The next step is to add an interferer. Who is the interferer, the man or the woman? Listen to the subsequent example where a DSB focuses on one source. Maybe a better beamformer is required?
Homework Exercise 6
Draw a flow chart with the steps to use the DSB. Include the indices of all required formulas for each step.

Part 2.5 Minimum variance distortionless response (MVDR) beamformer

The next task is the implementation of the MVDR beamformer. The MVDR beamformer is a signal dependent beamformer i.e. the filter coefficients for the MVDR beamformer depend on the statistical properties of the received signals. The aim of the MVDR beamformer is to minimize the power of the undesired signal components at the output while ensuring that the desired signal is not distorted. Mathematically, this can be formulated as

\begin{equation} \label{eq:mvdropt} \mathbf{h}_{\text{MVDR}}(m,k) = \operatorname*{arg\,min}_{\mathbf{h}} \mathbf{h}^{H}(m,k) \mathbf{\Phi}_{\mathbf{u}}(m,k) \mathbf{h}(m,k) \; \; \text{subject to} \; \; \mathbf{h}^{H}(m,k) \mathbf{d}(m,k) = 1 , \tag{14} \end{equation}

where the constraint in the second part ensures a distortionless response. The PSD of the undesired (interfering speech + noise) signals is denoted by $\mathbf{\Phi}_{\mathbf{u}}(m,k)$ and given by:

\begin{equation} \label{eq:psddef} \mathbf{\Phi}_{\mathbf{u}}(m,k) = \mathcal{E}\{ \mathbf{u}(m,k) \mathbf{u}^{H}(m,k) \}.\tag{15} \end{equation}

As can be seen Equation 14 is a constrained optimization problem, which can be solved using Lagrange multipliers. The obtained solution is given by:

\begin{equation} \label{eq:mvdr} \mathbf{h}_{\text{MVDR}}(m,k) = \frac{\mathbf{\Phi}_{\mathbf{u}}^{-1}(m,k) \mathbf{d}(m,k)}{\mathbf{d}^{H}(m,k) \mathbf{\Phi}_{\mathbf{u}}^{-1}(m,k) \mathbf{d}(m,k)} .\tag{16} \end{equation}

Given this formulation, it can be seen that the estimate of the PSD matrix for the undesired (interfering speech + noise) signals is required to obtain the MVDR filter.

Homework Exercise 7 Derive the expression of the MVDR beamformer if $\mathbf{\Phi_u}(m,k) = \mathbf{\Phi_n}(m,k)$, where $\mathbf{\Phi_n}(m,k)$ is the noise PSD matrix${}^\ast$. For the derivation, assume that the noise signals at each microphone are spatially uncorrelated, i.e., $\mathbf{\Phi_n}(m,k) = \sigma_n^2(m,k) \mathbf{I}$, denoting $\mathbf{I}$ the identity matrix. Compare the obtained expression to that of the DSB beamformer. Which beamformer is better for white noise reduction?
Lab Experiment 11
Implement the function compute_mvdr_weight(`RTF_vecs`, `Phi_u`) which computes the time and signal dependent MVDR weight-matrix. Input parameters: - `RTF_vecs`: array of the RTF vector of each frequency of shape (`K`,`N`) - `Phi_u`: Array of shape (`M`, `K`, `N`, `N`), with the number of time frames `M`, the number of frequency bins `K`, the number of microphones `N`. For each time frame `m` and each frequency `k`, there is a `N`$\times$`N` PSD matrix. Output parameters: - `h`: matrix of the mvdr weights of each frequency with the shape (`M`, `K`, `N`) Hint: Inspect the implementation of the weights for the DSB. You can use the function np.linalg.inv(PSD) to inverse a single PSD matrix.

Power spectral density (PSD) matrix computation

In this module, the task is to implement a general recursive PSD estimation method. This function is later called to estimate the PSD matrices of the desired speech (required later) and undesired (interfering speech + noise) signals. The theoretical formulation of the recursive estimation method can be given by: \begin{equation}\label{eq:psdest} \widehat{\mathbf{\Phi}}(m,k) = \mathcal{I}(m,k) [ \alpha \widehat{\mathbf{\Phi}}(m-1,k) + (1-\alpha) \mathbf{y}(m,k) \mathbf{y}^{H}(m,k)] + (1 - \mathcal{I}(m,k)) \widehat{\mathbf{\Phi}}(m-1,k),\tag{17} \end{equation} where $\mathcal{I}(m,k)$ is the indicator parameter mask that is used to determine at which time frequency points the relevant PSD matrix needs to be updated. The mask is different for the desired and undesired signals. If at a certain time-frequency point, the indicator parameter for the desired speech signal is $1$, then the desired speech signal PSD is updated, otherwise if the indicator parameter for the undesired signals is $1$ then the undesired signal PSD is updated. This indicator parameter (mask) is computed using an oracle mechanism and is not a part of the exercise. For practical implementation, Equation \ref{eq:psdest} can be simplified as:

\begin{equation} \label{eq:mainpsdest} \widehat{\mathbf{\Phi}}(m,k) = \alpha'(m,k) \widehat{\mathbf{\Phi}}(m-1,k) + (1-\alpha'(m,k)) \mathbf{y}(m,k) \mathbf{y}^{H}(m,k),\tag{18} \end{equation}

where the modified update factor $\alpha'(m,k)$ is given by:

\begin{equation} \label{eq:modalpha} \alpha'(m,k) = \alpha +(1 - \mathcal{I}(m,k))(1 - \alpha).\tag{19} \end{equation}
Lab Experiment 12
Implement the function compute_alpha_dash(`alpha`, `mask`). Input parameters: - `alpha`: scalar as defined in the text above. - `mask`: matrix indicator whether the signals are desired or undesired. It is of shape (`M`,`K`) Output parameters: - `alpha_dash`: vector as specified in Eq. 19 of shape (`M`,`K`,`1`,`1`).
Lab Experiment 13
Implement the function compute_psd(`old_psd`,`local_psd`,`alpha_dash`). Input parameters: - `old_psd`: matrix of shape (K,N,N) of time frame m-1. - `local_psd`: matrix of shape (K,N,N). - `alpha_dash`: matrix of shape (K,1,1). Output parameters: - `single_psd`: noise psd matrix of time-frame m of shape (K,N,N)

define class mvdr

Lab Experiment 14
Who was the interferer? Compare the results to the DSB. The next cell plots the noise output of the DSB and the MVDR. Which one is better?
Homework Exercise 8
Draw a flow chart with the steps to use a MVDR. Include the indices of all required formulas for each step.

Part 2.6 Performance evaluation measures

The performance of the beamformers is measured in terms of objective measures. In this module, we use three different objective measures for the evaluation of the performance of the implemented beamformers. These measures are explained as follows.

Interference reduction (IR) : The first objective measure for evaluation is the interference reduction. It evaluates the suppression of the interfering speech signal achieved by the filter. Here, we compute the average difference between the segmental power of the interfering clean speech signal and the segmental power of the filtered version of it. It is formulated as: $$\label{eq:ir} \text{IR} [ \text{dB}]= \frac{1}{Q} \sum\limits_{q=0}^{Q-1} 10 \log \left( \frac{\sum\limits_{t = qL}^{(q+1)L - 1}(\widehat{x}_{i}(t))^{2}}{\sum\limits_{t = qL}^{(q+1)L - 1}(x_{1,i}(t))^{2}} \right), \tag{20}$$ where $\widehat{x}_{i}(t)$ is the filtered version of the interfering speech signal and ${x}_{1,i}(t)$ denotes the interfering signal at the reference microphone. Here, $q$ is the segment index and $L$ denotes the length of each segment.

Noise reduction (NR): Noise reduction evaluates the suppression of additive noise achieved at the output. It is computed similarly to the IR. Instead of the interfering speech signal, we use the noise signal and its filtered version to evaluate the NR. It is formulated as: $$\label{eq:nr} \text{NR} [ \text{dB}]= \frac{1}{Q} \sum\limits_{q=0}^{Q-1} 10 \log \left( \frac{\sum\limits_{t = qL}^{(q+1)L - 1}(\widehat{v}(t))^{2}}{\sum\limits_{t = qL}^{(q+1)L - 1}(v_{1}(t))^{2}} \right), \tag{21}$$ where $\widehat{v}_{}(t)$ is the filtered version of the noise signal and ${v}_{1}(t)$ denotes the noise signal at the reference microphone. The variables $q$ and $L$ are defined similarly as before.

Signal distortion index (SDI): The signal distortion index measures the amount of distortion in the filtered version of the desired source signal with respect to the clean desired source signal at a reference microphone. It is formulated as: $$\text{SDI} = \frac{1}{Q} \sum\limits_{q=0}^{Q-1} \left( \frac{\sum\limits_{t = qL}^{(q+1)L - 1}(\widehat{x}_{d}(t) - x_{1,d}(t))^{2}}{\sum\limits_{t = qL}^{(q+1)L - 1}(x_{1,d}(t))^{2}} \right), \tag{22}$$ where $\widehat{x}_{d}(t)$ is the filtered version of the desired source signal and ${x}_{1,d}(t)$ is the clean speech signal of the desired source at a reference microphone. The variables $q$ and $L$ are defined similarly as before.

Lab Experiment 15
Inspect the evaluation. Which beamformer is better for noise reduction and which one for interference reduction. Is the result consistent with Homework 1? If not, why not?
Lab Experiment 16
Inspect the power pattern of the MVDR over time and frequency. Compare the MVDR to the DSB patterns.

LAB Extension - listen to some other Beamformers (just for interested Students):


LCMV plus Wiener Postfilter



We still use the ground truth DOA ... what happens if we estimate it?

You can replace the RTF vector estimation by: