International Audio Laboratories Erlangen

Professor:
Prof. Dr. ir. Emanuël Habets
emanuel.habets@audiolabs-erlangen.de

Authors:

Wolfgang Mack

Tutors:

Contact:

Wolfgang Mack,
wolfgang.mack@audiolabs-erlangen.de

Friedrich-Alexander Universität Erlangen-Nürnberg

International Audio Laboratories Erlangen

Am Wolfsmantel 33, 91058 Erlangen

# Python: How to ...¶

• perform a elementwise product
• compute a power spectral density (PSD) matrix with the matrix product

# Speech Enhancement Using Microphone Arrays¶

The goal of this notebook is to gain insights into speech enhancement techniques using microphone arrays. A code-skeletton is provided which you have to complete. First, a scenario overview is given.

## Abstract¶

This module is designed to give a practical understanding of performing speech enhancement using microphone arrays. It is is closely related to the lecture 'Speech Enhancement' by Prof. Dr. ir. Emanuël Habets.
In this exercise, you have to implement a commonly used spatial signal processing technique known as beamforming, and analyse the performance of two different beamformers, a signal-independent beamformer known as delay-and-sum beamformer (DSB) and a signal-dependent beamformer known as minimum variance distortionless response (MVDR) beamformer, for a noise and interference reduction task. A free-field scenario is considered, i.e., no reflections or shieldings caused by walls or obstacles are considered. The beamformer performances are compared via objective measures to demonstrate their advantages with respect to the specific reduction task.

## Part 0: Introduction¶

Microphone arrays play an important role in applications like hands-free communication in smartphones, smart speakers, and teleconferencing systems. The spatial diversity offered by microphone arrays can be used to extract the speech signal of interest from the microphone signals corrupted by noise, reverberation and interfering sources. A typical method for this is to obtain a sum of the filtered microphone signals such that the signal from the so-called look direction is maintained while signals from other directions are attenuated. Such a method is known as beamforming. This module gives a practical understanding of such methods used to perform speech enhancement. The notebook provides the theoretical background necessary to understand the formulation of the beamforming methods and the practical aspects regarding their implementation. For the experiments in this module, we consider a scenario where a microphone array on a hands-free device captures two sound sources. Additionally, each microphone signal contains white noise which models the self-noise of the microphone. The white noise is assumed to be uncorrelated between the microphones.

As illustrated in Figure 1, one of the sources is the desired source and the other is an interferer. The task in this experiment is to process the microphone signals to enhance the speech of the desired source while suppressing the interfering speech and noise.
The enhancement system for this module is illustrated in Figure 2.

The microphone signals are first transformed into the time-frequency domain via short-time Fourier transform (STFT) to obtain the input signals for the beamforming algorithms. An overview of the tasks for implementing the different beamformers in this module is given as follows:

• The first task is to compute the so-called relative transfer function (RTF) vector using the geometric information, i.e., the source and microphone positions. The RTF vector, along with the STFT domain microphone signals, is given as an input for the implementation of the signal-independent delay-and-sum beamformer (DSB). This is further explained in Part 2.2.

• The next task is to compute and apply the DSB to the microphone signals to obtain the filtered output (explained in Part 2.3).

• For the signal-dependent minimum variance distortionless response (MVDR) beamformer, we need to compute the power spectral density (PSD) matrices for the desired source, the interference and the noise signal, which is the next task. The PSD matrices are computed using the microphone signals and the ideal masks for each of the above mentioned signals. The ideal mask specifies at which time frequency points the PSD matrix needs to be updated. This is described in Part 2.4.

• The next task is to compute the MVDR filter for a free-field RTF model using the RTF vector and the PSD matrix of the undesired (interfering speech + noise) signals, and apply the filter to the input microphone signals to obtain the filtered output. The task is further explained in Part 2.5.

• The filtered outputs are finally transformed back to the time domain using an inverse STFT. The performance of the different beamforming algorithms is evaluated using objective measures. The objective measures used in this module are interference reduction (IR), noise reduction (NR) and signal distortion index (SDI).

## Part 1: Signal and RTF Model for a Uniform Linear Array¶

### Part 1.1: Signal Model¶

To explain the formulation of the beamforming techniques, we first need to define a signal model.

Consider a discrete time-domain signal model (shown in Figure 3), where $\text{N}$ microphones capture the desired source signal and the interference in presence of additive noise. Then the $n$th microphone signal for $n=1, \ldots,N$ is given by:

\begin{align} y_n(t) &= g_{n,d}(t) * s_{d}(t) + g_{n,i}(t) * s_{i}(t) + v_n(t),\\ &= x_{n,d}(t) + x_{n,i}(t) + v_{n}(t), \tag{1} \label{eq:tdsig} \end{align} where $s_{d}(t)$ is the desired source signal and $s_{i}(t)$ is the interfering speech signal. The acoustic impulse responses between the $n$th microphone and the sources are denoted by $g_{n,d}(t)$ and $g_{n,i}(t)$ for the desired and the interfering source, respectively. The variable $v_{n}(t)$ denotes the additive noise. The desired source signal and the interfering speech signal received at the $n$th microphone are denoted by $x_{n,d}(t)$ and $x_{n,i}(t)$, respectively. The sample-time index is represented by $t$ and $*$ denotes linear convolution. Since the frequency characteristics of a speech signal vary over time, the processing of the received signals is done in the time-frequency domain. The time domain signals are transformed to the time-frequency domain via STFT. The STFT representation of (\ref{eq:tdsig}) is given by:

\begin{align} Y_n(m,k) &= G_{n,d}(m,k) S_{d}(m,k) + G_{n,i}(m,k) S_{i}(m,k) + V_n(m,k), \\ \nonumber &= X_{n,d}(m,k) + \underbrace{X_{n,i}(m,k) + V_n(m,k)}_{U_{n}(m,k)}, \nonumber \tag{2} \end{align}

where the upper-case letters denote the time-frequency domain counterparts of the terms in (\ref{eq:tdsig}) for the $k$th frequency bin, and $m$ denotes the time-frame index. $U_{n}(m,k)$ denotes the undesired signals, which is the sum of the interference and the noise signals. We can express the $N$ STFT domain microphone signals in vector notation as:

\begin{align} \mathbf{y}(m,k) &= \mathbf{g}_{d}(m,k) S_{d}(m,k) + \mathbf{u}(m,k), \\ &= \mathbf{x}_{d}(m,k) + \mathbf{u}(m,k),\nonumber\label{eq:pythagoras} \tag{3} \end{align}

where $\mathbf{y}(m,k) = \left[ Y_1(m,k), Y_2(m,k), \ldots, Y_N(m,k) \right]^T$, and $\mathbf{x}_{d}(m,k)$, $\mathbf{g}_{d}(m,k)$ and $\mathbf{u}(m,k)$ are defined similarly.

We can write the desired source signal vector $\mathbf{x}_{d}(m,k)$ as a function of the received signal at the first microphone:

$$\mathbf{x}_{d}(m,k) = \mathbf{d}(m,k) X_{1,d}(m,k). \tag{4}$$

Using this definition, the microphone signals can also be written as

$$\mathbf{y}(m,k) = \mathbf{d}(m,k) X_{1,d}(m,k) + \mathbf{u}(m,k), \tag{5}$$

where d is known as relative transfer function (RTF) vector

### Part 1.2 RTF Model - Free-field Model¶

For the given signal model, we consider a free-field model where each microphone receives only the direct path signal. The RTF effects are modelled by the RTF vector $\mathbf{d}(m,k)$. The formulation of the RTF vector is described as follows.

The free-field model is illustrated in Figure 1. The vector of acoustic transfer functions for the desired source is denoted by $\mathbf{g}_{d}(k) = \left[ G_{1,d}(k), G_{2,d}(k), \ldots, G_{N,d}(k) \right]^T$. The acoustic transfer function corresponding to the $n$th microphone is given by

$$G_{n,d}(k) = A_{n,d}(k)\text{exp} \left(j 2 \pi f_k {\tau_{n,d}} \right), \tag{6}$$

where $A_{n,d}(k)$ is the attenuation factor for the $n$th microphone due to, e.g., distance attenuation, $f_{k} = \frac{k}{K} f_{s}$ denotes the frequency value corresponding to the $k$th frequency bin, and $\tau_{n,d}$ is the propagation time of the desired source signal to the $n$th microphone. The symbols $f_{s}$ and $K$ denote the sampling frequency and the total number of frequency bins, respectively. The RTF vector for the free-field model is given by:

\begin{align} \mathbf{d}(k) &= \frac{\mathbf{g}_{d}(k)}{G_{1,d}(k)}\\ &= \left[ 1, D_2(k), \ldots, D_N(k) \right]^T , \tag{7}\label{eq:ddef_frfld} \end{align}

with

$$D_n(k) = \frac{G_{n,d}(k)}{G_{1,d}(k)} = \frac{A_{n,d}(k)}{A_{1,d}(k)} \text{exp} \left(j 2 \pi f_k \Delta_{\tau_{n,d}} \right), \tag{8}\label{eq:frfldD}$$

where $\Delta_{\tau_{n,d}}$ is the time difference of arrival (TDOA) of the desired signal at the $n$th microphone with respect to the 1st microphone. This is the formulation of the RTF vector considered for the delay and sum beamformer, and the MVDR beamformer. In the free-field model, the relative transfer functions are considered to be time-independent.

### Part 1.3 Far-Field Model with an Uniform Linear Array¶

The microphone array is assumed to be sufficiently distant to the desired source and to be sufficiently small to approximate the circular sound waves at the microphones as plane wave as illustrated in Figure 4a.

The microphone array in this module is an Uniform Linear Array (ULA) as illustrated in Figure 4b. N microphones are equidistantly stacked in a row with inter-microphone distance d. We assign indices in a rising order to the microphones starting with 1 (the furthest at the left) which we refer to as reference microphone.
Further calculations require an orientation of the ULA in the 2-dimensional vector-space defined in Figure 1. For this purpose, we define the unit-norm vector ULA_direction to point from microphone 1 to N in the coordinates of the defined vector-space.
Figure 5 shows a summary of a wave traveling from a source to a ULA with regard to our assumptions (Far-Field, Free-Field, small ULA).

We define the angle theta in Figure 5 as direction of arrival (DOA) of the desired source. It is the angle between ULA_direction and the normal vector of the arriving wave front, denoted by $source_{dir}$ pointing in the source direction. According to the plane wave model, the DOA is equal for all microphones in the ULA. However, in reality, the incoming wave is not exactly plane but slightly curved because the distance between the desired source and the ULA is finite. To reduce the error coming from the model, the DOA is defined with respect to the ULA center. In practice the DOA is calculated as angle between the vector pointing from the ULA center to the source position and ULA_direction.

Given the DOA, the TDOA and the RTF vector defined in (8) can be calculated. The scenario allows to compute the TDOA with trigonometric functions (sine, cosine) given the DOA, the inter-microphone distance d, and the speed of sound c. The TDOA is used later to construct the RTF vector and eventually to compute the beamformer weights.

Homework Exercise 1 - 1.1 Derive a formula for the TDOA between the first and the second microphone based on theta and the inter-microphone distance d. Calculate the TDOA between the first and the third microphone for c$=343 \frac{m}{s}$, d$=0.05 m$ and theta$= \frac{pi}{4}$.
Hint: Look at the triangle with the angle theta in Figure 5. Can you construct a similar one with d and theta? - 1.2 How is the TDOA between the microphones one and two related to the TDOA between the microphones 1 and 3 or 1 and N? Derive the relation.
Homework Exercise 2 Given the desired source position source_pos, the ULA center position ULA_center, and ULA_direction, derive a formula for theta according to Figure 5. Remember, theta is equal for all microphones in the ULA.
Lab Experiment 1 Implement compute_doa(...). Input parameters:
source_pos: numpy array with the X and Y value of the desired source position
ULA_center: numpy array with the X and Y value of the ULA center
ULA_direction: unit-norm numpy array with 2 elements defining the endfire ULA orientation (left to right)
Output parameters:
theta:scalar angle between the vector pointing from the ULA center to the desired source and ULA_direction
Test 1 Test compute_doa(...): Is theta 45 degree for: - source_pos = np.array([5,5]), - ULA_center = np.array([0,0]) - ULA_direction = np.array([1,0])
Lab Experiment 2 Compute the time difference of arrival (TDOA) of the plane wave front between two microphones (the reference microphone with TDOA 0 is the left one) given theta, the inter-microphone distance d, and speed of sound c. Input parameters: - theta: scalar in radian - d: scalar microphone distance - c: scalar speed of sound Output parameters: TDOA: scalar time difference of arrival between two neighboring microphones Remember: The TDOA of each microphone is with respect to microphone 1 according to our definition (see Figure 4 and 5)
Test 2 Test compute_tdoa(...): Is TDOA =$-1$ for: - theta = $\frac{\pi}{4}$, - d = 0.343 - c = 343 For which theta do you expect TDOA to be 0? Try it!
Lab Experiment 3 Given the TDOA between two neighboring microphones, compute the TDOA of all microphones to the first microphone given a ULA (same distance between all microphones in the array, hence, same relative TDOA between neighboring microphones). Input parameters: - TDOA: time difference of arrival of a plane wave (from left to right) - N: total number of microphones in the ULA Output parameters: - tdoas: array of TDOAs of all N microphones with respect to the first one. The first value is 0 as the TDOA from the first to the first microphone is 0. The shape is (N,).
Test 3 Test compute_tdoa_list(...): Is tdoa_list =$[0.0,0.1,0.2,0.3]$ for: - tdoa = 0.1 - N = 4, i.e., a microphone array of four microphones and a time difference of arrival between neighboring microphones of 0.1 seconds

## Part 2 Beamforming¶

### 2.1 How to Apply Beamformers¶

Our aim is to obtain an estimate of the desired source signal at the first microphone, i.e., ${X}_{1,d}(m,k)$.

This is done by applying a spatial filter to the observed microphone signals and summing across the array elements (shown in Figure 6) and is given by:

$$\widehat{X}_{1,d}(m,k) = \mathbf{h}^{H}(m,k) \mathbf{y}(m,k), \nonumber \tag{9}$$

where $\mathbf{h}(m,k) = \left[ H_{1}^{*}(m,k), H_{2}^{*}(m,k), \ldots, H_{N}^{*}(m,k) \right]^{T}$ is a filter of length $N$ and $(\cdot)^{\text{H}}$ denotes the conjugate transpose or the Hermitian of a vector/matrix. Now, with the given framework the aim is to develop an analytic expression for the filter $\mathbf{h}(m,k)$, which is given by the different beamforming techniques explained in the Parts 2.3 and 2.5.

## 2.2 Relative Transfer Function Vector¶

The RTF vector $\mathbf{d}(k)$ is required to compute the beamformer filters in this module.

Homework Exercise 3
Using the expressions for $\mathbf{d}(k)$ and its elements, given in Eq. (7) and Eq. (8), write a simplified expression for $\mathbf{d}(k)$. Assume an attenuation free scenario.

For sources with a fixed position, $\mathbf{d}(k)$ is frequency dependent but time independent.

Lab Experiment 4 Given tdoas, the free-field RTF vector can be computed for all frequencies according to the equation derived in the previous Homework. Implement compute_RTF_for_freq(frequencies , tdoas) . Input parameters: - frequencies: frequencies to compute the RTF vector for
tdoas: array of the time of arrival differences (scalars) of a plane wave from source direction at the microphones Output parameters: - RTF_vec: array containing the phase shifts for each microphone with respect to the first microphone for all frequencies (so the shape is (K,N))
Test 4 Test compute_RTF_for_freq(frequencies, tdoa_list): Is RTF_vec.shape =$(1,4)$ in the next cell ?
Homework Exercise 4
Draw a flow chart with the steps to compute the RTF vector.
Lab Experiment 5 Use the flow chart of Homework Exercise 4 and the functions you implemented in the previous Lab Experiments to program compute_RTF_vector(source_pos, ULA_center, ULA_direction, frequencies, d, c, N). Input parameters: - See documentation above Output parameters: - RTF_vecs: array containing all the RTF vectors for all frequencies. It is of shape (K,N).

## Part 2.3 Delay and Sum Beamformer¶

The delay and sum beamformer (DSB) is a fixed beamformer, i.e., the parameters of the beamformer are time-invariant and signal independent. They depend on the desired source and the ULA center positions. As the name suggests, this beamformer works by delaying the signals from certain microphones and then summing them afterwards. To explain this further, consider an array of $N$ microphones. When the microphone array picks up a signal coming from an angle other than 90 degrees, every consecutive microphone receives the signal with positive or negative delay. This is because the signal entering from an angle needs to travel an additional distance to the next microphone in the array. This fact is exploited here to obtain a constructive interference in terms of the desired signal and a destructive interference in terms of the signals from noise or interfering sources.

To obtain a constructive interference in terms of the desired source, the filter needs to satisfy

$$\mathbf{h}^{H}(k) \mathbf{d}(k) = 1, \label{eq:DSBcon}\tag{10}$$

where $\mathbf{d}(k)$ is the RTF vector for the free-field model, given in Part 1.2.

Given the RTF vector from Part 2.2, the next task is to apply the DSB filter. Using the condition given in Equation \ref{eq:DSBcon}, the DSB filter is given by:

$$\label{eq:appdsb} \mathbf{h}(k) = \frac{1}{N}\mathbf{d}(k). \tag{11}$$

To explain $\mathbf{h}(k)$ further, consider a single frequency k as plane wave which is captured by a ULA as in Figure 7.

The ULA spatially samples the sine synchronously with N microphones yielding points with different phase offsets with respect to the first microphone due to the spatial diversity of the microphones. All phase offsets are multiples of the first phase offset between the first and the second microphone because of the plane wave assumption. These phase offsets are related to the TDOA and can be translated into the RTF vector $\mathbf{d}(k)$.
The general idea of the DSB is to correct the phase offset (delay the signal) by multiplying the microphone spectra with complex exponential functions (the DSB weights) to obtain constructive interference for the desired signal in the consecutive averaging step over the microphone spectra.

Lab Experiment 6 Implement the function compute_dsb_weight(RTF_vecs) which computes the time and signal invariant DSB weight-matrix. Input parameters: - RTF_vecs: array of the RTF vector of each frequency. The array is of shape (K,N). Output parameters: - dsb_weights: matrix of the dsb weights of each frequency with the shape (K,N)
Lab Experiment 7 Implement the function apply_dsb_weights(dsb_weights, y) which applies the dsb weight matrix to the microphone spectra and returns the enhanced spectrum. Input parameters: - dsb_weights: matrix of the dsb weights for each frequency with the shape (1,K, N) - y: matrix of the spectra of all N microphones of shape (N, K, M) where M is the total number of time frames Output parameters: - X_1_hat: complex matrix of a single enhanced spectrum of shape (K, M) Note: A common error is when X_1_hat.dtype is float and not complex!

You implemented all relevant functions for the DSB. To compare different beamformer parameters, a class environment is defined. We store all relevant parameter information in it (source position, ULA position, ...). You do not have to change anything here.

Now we define a class Beamformer which contains an environment and a weights matrix. The class DSB extends the class Beamformer and contains functions to compute the beamformer weights and to apply them. You already implemented those.

Three environments are created with different parameters to investigate their impact on the performance. We create three DSBs and assign the environments to them. Finally, we compute the DSB weight matrices.

Some functions to visualize the DSB power patterns are already implemented. Run the cell below and inspect the power patterns of the DSBs.

N=4
d=0.05

N=4
d=0.1

## DSB2¶

N=8
d=0.05

Lab Experiment 8 Look at the power patterns. Is the main lobe narrower for a large or small d? How does this affect your decision of Homework 5? Could you build a better beamformer if you had more microphones? In general: If the array apertur is large, is the main lobe broader or narrower? Describe the trade off between a large array apertur and a small d with the results of the module experiments.

## Part 2.4 Spatial Aliasing¶

Spatial aliasing is the result of insufficient spatial sampling of a sound field. It produces angular ambiguities of a beamformer for specific frequencies and relative angles between two DOAs which makes it impossible for the beamformer to perform speech enhancement.
Subsequently, we describe angular ambiguities resulting from the ULA symmetry and spatial aliasing.
Beamformers amplify or attenuate signals dependent on the frequency and the DOA.
Not spatial aliasing:
Given a ULA, the power pattern is symmetric to the ULA axis which results in angular ambiguities, i.e., for ULA beamforming, there is a front-back ambiguity. Beamforming cannot attenuate the interferer if it is placed symmetrical to the desired source w.r.t. the ULA axis. This effect, however, is independent of the wavelength and inter-microphone distance.

Spatial aliasing:
For frequencies above a given threshold, $$\frac{\lambda}{2}\leq d,\tag{12}$$ with the wavelength $\lambda$, there are additional frequency dependent angular ambiguities which are called spatial aliases. Please notice the similarity to the Nyquist theorem for sampling in time domain. Consequently, in spatial domain the theorem is called spatial Nyquist theorem.
Subsequently, we derive spatial aliasing for ULAs and also calculate the angular position of the aliases.
For a given wave length, two DOAs, $\theta_1$ and $\theta_2$, cannot be distinguished if

$$\frac{|d \cdot cos(\theta_1) - d \cdot cos(\theta_2)|}{\lambda} = n \text{ } \forall \text{ }n \text{ }\in \mathbb{N}^+.\tag{13}$$

Recapitulate Figure 5 and the calculation of the TDOA. $d \cdot cos(\theta)$ yields the additional distance an incoming plane wave has to travel from microphone to microphone. If the distance difference of two DOAs is a multiple of the wave length, then the beamformer cannot distinguish the DOAs for the specific frequency as they have the same phase offset when wrapped to the range $0$ and $2\pi$. Hence, spatial aliasing occurs. Furthermore, the higher the frequency, the more wave lengths fit into the difference. As a result, the number of spatial aliases increases for higher frequencies. Please consider the ULA symmetry. The difference between $\theta_1$ and $\theta_2$ cannot be greater than $\pi$.

Homework Exercise 5 Spatial aliasing occurs if $\frac{\lambda}{2} \leq d$. What to you conclude from this result. Would you select a small or a large d to avoid spatial aliasing?
Homework Exercise 6
At which wave length appears the second spatial alias if the first one appears at $\lambda = 10 m$. Remember: The first alias appears if the distance difference the two incoming plane waves have to travel in Equation 13 is equal to the wave length.

Afterwards, listen to some examples :)

N= 4
d=0.05

N = 4
d = 0.1

## Results DSB2¶

N = 8
d=0.05

Lab Experiment 9
Listen to the results above and inspect the spectrograms. To reduce white noise, averaging the microphone signals has the same effect as the DSB. Think about the disadvantage of this method and why the DSB performs better (also for which frequencies and which DOA of the desired source).
Lab Experiment 10
The next step is to add an interferer. Who is the interferer, the man or the woman? Listen to the subsequent example where a DSB focuses on one source. Maybe a better beamformer is required?
Homework Exercise 6
Draw a flow chart with the steps to use the DSB. Include the indices of all required formulas for each step.

### Part 2.5 Minimum variance distortionless response (MVDR) beamformer¶

The next task is the implementation of the MVDR beamformer. The MVDR beamformer is a signal dependent beamformer i.e. the filter coefficients for the MVDR beamformer depend on the statistical properties of the received signals. The aim of the MVDR beamformer is to minimize the power of the undesired signal components at the output while ensuring that the desired signal is not distorted. Mathematically, this can be formulated as

$$\label{eq:mvdropt} \mathbf{h}_{\text{MVDR}}(m,k) = \operatorname*{arg\,min}_{\mathbf{h}} \mathbf{h}^{H}(m,k) \mathbf{\Phi}_{\mathbf{u}}(m,k) \mathbf{h}(m,k) \; \; \text{subject to} \; \; \mathbf{h}^{H}(m,k) \mathbf{d}(m,k) = 1 , \tag{14}$$

where the constraint in the second part ensures a distortionless response. The PSD of the undesired (interfering speech + noise) signals is denoted by $\mathbf{\Phi}_{\mathbf{u}}(m,k)$ and given by:

$$\label{eq:psddef} \mathbf{\Phi}_{\mathbf{u}}(m,k) = \mathcal{E}\{ \mathbf{u}(m,k) \mathbf{u}^{H}(m,k) \}.\tag{15}$$

As can be seen Equation 14 is a constrained optimization problem, which can be solved using Lagrange multipliers. The obtained solution is given by:

$$\label{eq:mvdr} \mathbf{h}_{\text{MVDR}}(m,k) = \frac{\mathbf{\Phi}_{\mathbf{u}}^{-1}(m,k) \mathbf{d}(m,k)}{\mathbf{d}^{H}(m,k) \mathbf{\Phi}_{\mathbf{u}}^{-1}(m,k) \mathbf{d}(m,k)} .\tag{16}$$

Given this formulation, it can be seen that the estimate of the PSD matrix for the undesired (interfering speech + noise) signals is required to obtain the MVDR filter.

Homework Exercise 7 Derive the expression of the MVDR beamformer if $\mathbf{\Phi_u}(m,k) = \mathbf{\Phi_n}(m,k)$, where $\mathbf{\Phi_n}(m,k)$ is the noise PSD matrix${}^\ast$. For the derivation, assume that the noise signals at each microphone are spatially uncorrelated, i.e., $\mathbf{\Phi_n}(m,k) = \sigma_n^2(m,k) \mathbf{I}$, denoting $\mathbf{I}$ the identity matrix. Compare the obtained expression to that of the DSB beamformer. Which beamformer is better for white noise reduction?
Lab Experiment 11
Implement the function compute_mvdr_weight(RTF_vecs, Phi_u) which computes the time and signal dependent MVDR weight-matrix. Input parameters: - RTF_vecs: array of the RTF vector of each frequency of shape (K,N) - Phi_u: Array of shape (M, K, N, N), with the number of time frames M, the number of frequency bins K, the number of microphones N. For each time frame m and each frequency k, there is a N$\times$N PSD matrix. Output parameters: - h: matrix of the mvdr weights of each frequency with the shape (M, K, N) Hint: Inspect the implementation of the weights for the DSB. You can use the function np.linalg.inv(PSD) to inverse a single PSD matrix.

### Power spectral density (PSD) matrix computation¶

In this module, the task is to implement a general recursive PSD estimation method. This function is later called to estimate the PSD matrices of the desired speech (required later) and undesired (interfering speech + noise) signals. The theoretical formulation of the recursive estimation method can be given by: $$\label{eq:psdest} \widehat{\mathbf{\Phi}}(m,k) = \mathcal{I}(m,k) [ \alpha \widehat{\mathbf{\Phi}}(m-1,k) + (1-\alpha) \mathbf{y}(m,k) \mathbf{y}^{H}(m,k)] + (1 - \mathcal{I}(m,k)) \widehat{\mathbf{\Phi}}(m-1,k),\tag{17}$$ where $\mathcal{I}(m,k)$ is the indicator parameter mask that is used to determine at which time frequency points the relevant PSD matrix needs to be updated. The mask is different for the desired and undesired signals. If at a certain time-frequency point, the indicator parameter for the desired speech signal is $1$, then the desired speech signal PSD is updated, otherwise if the indicator parameter for the undesired signals is $1$ then the undesired signal PSD is updated. This indicator parameter (mask) is computed using an oracle mechanism and is not a part of the exercise. For practical implementation, Equation \ref{eq:psdest} can be simplified as:

$$\label{eq:mainpsdest} \widehat{\mathbf{\Phi}}(m,k) = \alpha'(m,k) \widehat{\mathbf{\Phi}}(m-1,k) + (1-\alpha'(m,k)) \mathbf{y}(m,k) \mathbf{y}^{H}(m,k),\tag{18}$$

where the modified update factor $\alpha'(m,k)$ is given by:

$$\label{eq:modalpha} \alpha'(m,k) = \alpha +(1 - \mathcal{I}(m,k))(1 - \alpha).\tag{19}$$
Lab Experiment 12
Implement the function compute_alpha_dash(alpha, mask). Input parameters: - alpha: scalar as defined in the text above. - mask: matrix indicator whether the signals are desired or undesired. It is of shape (M,K) Output parameters: - alpha_dash: vector as specified in Eq. 19 of shape (M,K,1,1).
Lab Experiment 13
Implement the function compute_psd(old_psd,local_psd,alpha_dash). Input parameters: - old_psd: matrix of shape (K,N,N) of time frame m-1. - local_psd: matrix of shape (K,N,N). - alpha_dash: matrix of shape (K,1,1). Output parameters: - single_psd: noise psd matrix of time-frame m of shape (K,N,N)

define class mvdr

Lab Experiment 14
Who was the interferer? Compare the results to the DSB. The next cell plots the noise output of the DSB and the MVDR. Which one is better?
Homework Exercise 8
Draw a flow chart with the steps to use a MVDR. Include the indices of all required formulas for each step.

### Part 2.6 Performance evaluation measures¶

The performance of the beamformers is measured in terms of objective measures. In this module, we use three different objective measures for the evaluation of the performance of the implemented beamformers. These measures are explained as follows.

Interference reduction (IR) : The first objective measure for evaluation is the interference reduction. It evaluates the suppression of the interfering speech signal achieved by the filter. Here, we compute the average difference between the segmental power of the interfering clean speech signal and the segmental power of the filtered version of it. It is formulated as: $$\label{eq:ir} \text{IR} [ \text{dB}]= \frac{1}{Q} \sum\limits_{q=0}^{Q-1} 10 \log \left( \frac{\sum\limits_{t = qL}^{(q+1)L - 1}(\widehat{x}_{i}(t))^{2}}{\sum\limits_{t = qL}^{(q+1)L - 1}(x_{1,i}(t))^{2}} \right), \tag{20}$$ where $\widehat{x}_{i}(t)$ is the filtered version of the interfering speech signal and ${x}_{1,i}(t)$ denotes the interfering signal at the reference microphone. Here, $q$ is the segment index and $L$ denotes the length of each segment.

Noise reduction (NR): Noise reduction evaluates the suppression of additive noise achieved at the output. It is computed similarly to the IR. Instead of the interfering speech signal, we use the noise signal and its filtered version to evaluate the NR. It is formulated as: $$\label{eq:nr} \text{NR} [ \text{dB}]= \frac{1}{Q} \sum\limits_{q=0}^{Q-1} 10 \log \left( \frac{\sum\limits_{t = qL}^{(q+1)L - 1}(\widehat{v}(t))^{2}}{\sum\limits_{t = qL}^{(q+1)L - 1}(v_{1}(t))^{2}} \right), \tag{21}$$ where $\widehat{v}_{}(t)$ is the filtered version of the noise signal and ${v}_{1}(t)$ denotes the noise signal at the reference microphone. The variables $q$ and $L$ are defined similarly as before.

Signal distortion index (SDI): The signal distortion index measures the amount of distortion in the filtered version of the desired source signal with respect to the clean desired source signal at a reference microphone. It is formulated as: $$\text{SDI} = \frac{1}{Q} \sum\limits_{q=0}^{Q-1} \left( \frac{\sum\limits_{t = qL}^{(q+1)L - 1}(\widehat{x}_{d}(t) - x_{1,d}(t))^{2}}{\sum\limits_{t = qL}^{(q+1)L - 1}(x_{1,d}(t))^{2}} \right), \tag{22}$$ where $\widehat{x}_{d}(t)$ is the filtered version of the desired source signal and ${x}_{1,d}(t)$ is the clean speech signal of the desired source at a reference microphone. The variables $q$ and $L$ are defined similarly as before.

Lab Experiment 15
Inspect the evaluation. Which beamformer is better for noise reduction and which one for interference reduction. Is the result consistent with Homework 1? If not, why not?
Lab Experiment 16
Inspect the power pattern of the MVDR over time and frequency. Compare the MVDR to the DSB patterns.

# LAB Extension - listen to some other Beamformers (just for interested Students):¶

• Parametric Multi-Channel Wiener Filter

• Is a combination of an MVDR and a subsequent Postfilter (Wiener Filter)
• More noise reduction but also more speech distortion
• Linear-constrained Minimum Variance Distortionless Response (LCMV) Filter

• Has an additional constrained to remove a undesired source (here second speaker)
• Worse noise reduction than MVDR