Short-Time Fourier Transform and Chroma Features

Authors: Meinard Müller, Jonathan Driedger. Thomas Prätzlich, Frank Zalkow

References:
[Mueller2015] Meinard Müller. Fundamentals of Music Processing. Springer Verlag, 2015.

MIR-Course: Harmonic Percussive Source Separation

Sounds can broadly be classified into two classes. Harmonic sound on the one hand side is what we perceive as pitched sound and what makes us hear melodies and chords. Percussive sound on the other hand is noise-like and usually stems from instrument onsets like the hit on a drum or from consonants in speech. The goal of harmonic-percussive source separation (HPSS) is to decompose an input audio signal into a signal consisting of all harmonic sounds and a signal consisting of all percussive sounds. In this lab course, we study an HPSS algorithm and implement it in Python. Exploiting knowledge about the spectral structure of harmonic and percussive sounds, this algorithm decomposes the spectrogram of the given input signal into two spectrograms, one for the harmonic, and one for the percussive component. Afterwards, two waveforms are reconstructed from the spectrograms which finally form the desired signals. Additionally, we describe the application of HPSS for enhancing chroma feature extraction and onset detection. The techniques used in this lab cover median filtering, spectral masking and the inversion of the short-time Fourier transform.

When listening to our environment, there exists a wide variety of different sounds. However, on a very coarse level, many sounds can be categorized to belong in either one of two classes: harmonic or percussive sounds. Harmonic sounds are the ones which we perceive to have a certain pitch such that we could for example sing along to them. The sound of a violin is a good example of a harmonic sound. Percussive sounds often stem from two colliding objects like for example the two shells of castanets. An important characteristic of percussive sounds is that they do not have a pitch but a very clear localization in time. Many real-world sounds are mixtures of harmonic and percussive components. For example, a note played on a piano has a percussive onset (resulting from the hammer hitting the strings) preceding the harmonic tone (resulting from the vibrating string).

Homework Exercise 1
Think about three real world examples of sounds which are clearly harmonic and three examples of sounds which are clearly percussive.
What are characteristics of harmonic and percussive signals? Sketch a waveform of a percussive signal and the waveform of a harmonic signal. What are the main differences between those waveforms?

The goal of harmonic-percussive source separation (HPSS) is to decompose a given input signal into a sum of two component signals, one consisting of all harmonic sounds and the other consisting of all percussive sounds. The core observation in many HPSS algorithms is that in a spectrogram representation of the input signal, harmonic sounds tend to form horizontal structures (in time-direction), while percussive sounds form vertical structures (in frequency-direction). For an example, have a look at following Figure where you can see the power spectrograms of two signals. The left Figure shows the power spectrogram of a sine-tone with a frequency of $4000$ Hz and a duration of one second. This tone is as harmonic as a sound can be. The power spectrogram shows just one horizontal line. Contrary, the power spectrogram on the rigtht side shows just one vertical line. It is the spectrogram of a signal which is zero everywhere, except for the sample at $0.5$ seconds where it is one. Therefore, when listening to this signal, we just hear a brief ``click'' at $0.5$ seconds. This signal is the prototype of a percussive sound. The same kind of structures can be observed in lower Figure, which shows a spectrogram of a violin recording and a spectrogram of a castanets recording.

Spectrogram of an ideal harmonic signal. Spectrogram of an ideal percussive signal.
Spectrogram of a recording of a violin. Spectrogram of a recording of a castanets.

Real world signals are usually mixtures of harmonic and percussive sounds. Furthermore, there is no absolute definition of when a sound stops "being harmonic" and starts "being percussive". Think, for example, of white noise which cannot be assigned to either one of these classes. However, with the above observations it is possible to decide if a time-frequency instance of a spectral representation of the input signal, like the short-time Fourier transform (STFT), belongs rather to the harmonic component or rather to the percussive component. This can be done in the following way. Assume we want to find out if a time-frequency bin in the STFT of the input signal belongs to the harmonic component. In this case, the bin should be part of some horizontal, and therefore harmonic structure. We can check this by first applying some filter to the power spectrogram of the STFT, which enhances horizontal structures and suppresses vertical structures and see if the filtered bin has some "high value". However, even if its value is high, it might still belong to some even stronger vertical, and therefore percussive structure. We therefore apply another filter to the power spectrogram which enhances vertical structures and suppresses horizontal structures. Now, in the case that the value of our bin in this vertically enhanced spectrogram is lower than in the horizontally enhanced spectrogram, it is very likely that it belongs to some harmonic sound and we can assign it to the harmonic component. Otherwise, if its value was higher in the vertically enhanced spectrogram, we directly know that it is rather part of some percussive sound and assign it to the percussive component. This way, we can decide for every time-frequency instance of the original STFT of the input signal whether it belongs to the harmonic, or to the percussive component and construct two new STFTs. In the STFT for the harmonic component, all bins which were assigned to the percussive component are set to zero, and vice versa for the percussive component. Finally, by "inverting" these STFTs, we get the audio signals for the harmonic and the percussive component.

Homework Exercise 2
Suppose you apply an HPSS algorithm to white noise. Recall that white noise has a constant power spectral density (it is also said to be flat). What do you expect the harmonic and the percussive component to sound like?
If you apply an HPSS algorithm to a recording of your favorite rock band. What do you expect the harmonic and the percussive component to sound like?

We will now describe an actual HPSS algorithm. Formally, given a discrete input audio signal $x:{\mathbb Z}\to{\mathbb R}$, the algorithm should compute a harmonic component signal $x_\mathrm{h}$ and a percussive component signal $x_\mathrm{p}$, such that $x = x_\mathrm{h} + x_\mathrm{p}$.
Furthermore, the signals $x_\mathrm{h}$ and $x_\mathrm{p}$ contain the harmonic and percussive sounds of $x$, respectively. In the following we describe the consecutive steps of an HPSS algorithm. We start with the computation of the STFT (Subsection STFT) and proceed with enhancing the power spectrogram using median filtering (Subsection Median Filtering). Afterwards, the filtered spectrograms are used to compute binary masks (Subsection Binary Masking) which are used to construct STFTs for the harmonic and the percussive component. These STFTs are finally transformed back to the time domain (Subsection ISTFT).

In the first step, we compute the short-time Fourier transform (STFT) ${\mathcal X}$ of the signal $x$ as:

\begin{equation} {\mathcal X}(m,k):= \sum_{n=0}^{N-1} x(n + m H)w(n)\exp(-2\pi ikn/N) \end{equation}

with ${m\in[0:M-1]:=\{0,\ldots,M -1\}}$ and $k\in[0:N-1]$, where $M$ is the number of frames, $N$ is the frame size and length of the discrete Fourier transform, ${w:[0:N -1]\to{\mathbb R}}$ is a window function and $H$ is the hopsize. From ${\mathcal X}$ we can then derive the power spectrogram ${\mathcal Y}$ of $x$:

\begin{equation} {\mathcal Y}(m,k) := |{\mathcal X}(m,k)|^2. \end{equation}

Homework Exercise 3
The parameters of the STFT have a crucial influence on the HPSS algorithm. Think about what happens to ${\mathcal Y}$ in the case you choose $N$ to be very large or very small. How could this influence the algorithm? (Hint: Think about how $N$ influences the time- and frequency-resolution of the STFT.)
  • For large $N$ frames cover longer segments of the signal and the frequency resolution increases while the time-resolution decreases. For small $N$ frames cover shorter segments of the signal and the frequency-resolution decreases while the time-resolution increases.
Explain in technical terms why harmonic sounds form horizontal and percussive sounds form vertical structures in spectrograms. (Hint: Have a look at the exponential basis functions of the STFT. What does one of these functions describe? How can an impulse be represented with them?)
  • To represent a harmonic sound in terms of a sum of sinusoidals (the basis functions of the Fourier transform) one needs just a few of them and they do not change over time. This means that the same frequency bands have high values in all frames which leads to horizontal structures in a spectrogram. To represent a short burst which is more noise-like, one needs many sinusoidals. Almost all frequency bands will show energy to represent such a burst. However, this burst has a very short duration, so this structure will only last for one or two frames which results in a vertical structure in the spectrogram.
Lab Experiment 1
Load an audio file CastanetsViolin.wav using sf.read.
In [ ]:
import soundfile as sf
from IPython.display import Audio

# your code here...

Audio(x, rate=Fs)
Compute the STFT ${\mathcal X}$ of the input signal $x$ using librosa.stft with the parameters N=1024, H=512, and a hann-window.
In [ ]:
import librosa

# your code here...
Compute the power spectrogram ${\mathcal Y}$.
In [ ]:
import numpy as np

# your code here...
Visualize ${\mathcal Y}$. Can you spot harmonic and percussive structures? Apply logarithmic compression with $\gamma=10$ (see STFT Lab for details on this) when visualizing spectrograms.
In [ ]:
from matplotlib import pyplot as plt
%matplotlib inline

# your code here...
Do the same for the parameters $N=128$, $H=64$, a hann-window, and $N=8192$, $H=4096$, and a hann-window. How do the spectrograms change when you change the parameters? What happens to the harmonic and percussive structures?
In [ ]:
# your code here...

In the next step, we want to compute a harmonically enhanced spectrogram $\tilde{{\mathcal Y}}_\mathrm{h}$ and a percussively enhanced spectrogram $\tilde{{\mathcal Y}}_\mathrm{p}$ by filtering ${\mathcal Y}$. This can be done by using a median filter. The median of a list of numbers can be found by arranging all numbers from lowest to highest value and picking the middle one. E.g. the median of the list $(7, 3, 4, 6, 5)$ is $5$. Formally, let $A = (a_1, a_2, \dots, a_L)$ be a list of length $L \in {\mathbb N}$ consisting of real numbers $a_l \in {\mathbb R}, l \in [1:L]$. First, the elements of $A$ are sorted in ascending order. This results in a list $\tilde{A} = (\tilde{a}_1, \tilde{a}_2, \dots, \tilde{a}_L)$ with $\tilde{a}_l \leq \tilde{a}_m$ for $l < m$ and $l, m \in [1:L]$. Then, the ${\mathrm{median}}$ of $A$ is defined as

\begin{equation} {\mathrm{median}}(A) := \begin{cases} \tilde{a}_{(L+1)/2} & \mbox{for $L$ being odd}\\ (\tilde{a}_{L/2} + \tilde{a}_{L/2+1})/2 & \mbox{otherwise}\end{cases} \end{equation}

Now, given a matrix $B\in{\mathbb R}^{M\times K}$, we define harmonic and percussive median filters

\begin{eqnarray} {\mathrm{medfilt}_\mathrm{h}}(B)(m,k) := {\mathrm{median}}(\{B(m-\ell_\mathrm{h},k),\ldots,B(m+\ell_\mathrm{h},k)\})\\ {\mathrm{medfilt}_\mathrm{p}}(B)(m,k) := {\mathrm{median}}(\{B(m,k-\ell_\mathrm{p}),\ldots,B(m,k+\ell_\mathrm{p})\}) \end{eqnarray}

for $M,K,\ell_\mathrm{h},\ell_\mathrm{p}\in{\mathbb N}$, where $2\ell_\mathrm{h} + 1$ and $2\ell_\mathrm{p} + 1$ are the lengths of the median filters, respectively. Note that we simply assume $B(m,k)=0$ for $m \notin [0:M-1]$ or $k \notin [0:K-1]$. The enhanced spectrograms are then computed as

\begin{eqnarray} \tilde{{\mathcal Y}}_\mathrm{h} := {\mathrm{medfilt}_\mathrm{h}}({\mathcal Y})\\ \tilde{{\mathcal Y}}_\mathrm{p} := {\mathrm{medfilt}_\mathrm{p}}({\mathcal Y}) \end{eqnarray}

Homework Excercise 4
The arithmetic ${\mathrm{mean}}$ of a set $A\subset{\mathbb R}$ of size $N$ is defined as ${{\mathrm{mean}}(A):= \frac{1}{N}\sum_{n=0}^{N-1}a_n}$. Compute the ${\mathrm{median}}$ and the ${\mathrm{mean}}$ for the set ${A=\{2, 3, 190, 2, 3\}}$. Why do you think the HPSS algorithm employs median filtering and not mean filtering?
Apply a horizontal and a vertical median filter of length $3$ to the matrix \begin{equation*} B = \begin{bmatrix} 1 & 1 & 46 & 2 \\ 3 & 1 & 50 & 1 \\ 60 & 68 & 70 & 67 \\ 2 & 1 & 65 & 1 \end{bmatrix} \end{equation*}
In [ ]:
from scipy import signal
import numpy as np

# your code here...

B = np.array([[1, 1, 46, 2], [3, 1, 50, 1], [60, 68, 70, 67], [2, 1, 65, 1]], dtype='float64')
print('Origignal B:')
print(B)
print('Horizontally filtered B:')
print(horizontal_median_filter(B, 3))
print('Vertically filtered B:')
print(vertical_median_filter(B, 3))
Explain in your own words why median filtering allows for enhancing/suppressing harmonic/percussive structures in a spectrogram.
Lab Experiment 2
Apply harmonic and percussive median filters to the power spectrogram ${\mathcal Y}$ which you computed in the previous exercise (N=1024, H=512, and a hann-window using scipy.signal.medfilt.
In [ ]:
# your code here...
Play around with different filter lengths (3, 11, 51, 101). Visualize the filtered spectrograms. What are your observations?
In [ ]:
# your code here...

Having the enhanced spectrograms $\tilde{{\mathcal Y}}_\mathrm{h}$ and $\tilde{{\mathcal Y}}_\mathrm{p}$, we now need to assign all time-frequency bins of ${\mathcal X}$ to either the harmonic or the percussive component. This can be done by binary masking. A binary mask is a matrix ${\mathcal M}\in\{0,1\}^{M\times K}$. It can be applied to an STFT ${\mathcal X}$ by computing ${\mathcal X} \odot {\mathcal M}$, where the operator $\odot$ denotes point-wise multiplication. A mask value of one preserves the value in the STFT and a mask value of zero suppresses it. For our HPSS algorithm, the binary masks are defined by comparing the values in the enhanced spectrograms $\tilde{{\mathcal Y}}_\mathrm{h}$ and $\tilde{{\mathcal Y}}_\mathrm{p}$.

\begin{eqnarray} {\mathcal M}_\mathrm{h}(m,k) := \begin{cases} 1 & \text{if } \tilde{{\mathcal Y}}_\mathrm{h}(m,k) \geq \tilde{{\mathcal Y}}_\mathrm{p}(m,k) \\ 0 & \text{else} \end{cases} \\ {\mathcal M}_\mathrm{p}(m,k) := \begin{cases} 1 & \text{if } \tilde{{\mathcal Y}}_\mathrm{p}(m,k) > \tilde{{\mathcal Y}}_\mathrm{h}(m,k) \\ 0 & \text{else.} \end{cases} \end{eqnarray}

Applying these masks to the original STFT ${\mathcal X}$ yields the STFTs for the harmonic and the percussive component of the signal ${{\mathcal X}_\mathrm{h} := ({\mathcal X} \odot {\mathcal M}_\mathrm{h})}$ and ${{\mathcal X}_\mathrm{p} := ({\mathcal X} \odot {\mathcal M}_\mathrm{p})}$. Note that by the definition of ${\mathcal M}_\mathrm{h}$ and ${\mathcal M}_\mathrm{p}$, it holds that ${\mathcal M}_\mathrm{h}(m,k)+{\mathcal M}_\mathrm{p}(m,k) = 1$ for $m \in[0:M-1]$, $k\in[0:K-1]$. Therefore, every time-frequency bin of ${\mathcal X}$ is assigned either to ${\mathcal X}_\mathrm{h}$ or ${\mathcal X}_\mathrm{p}$.

Homework Excercise 5
Assume you have the two enhanced spectrograms \begin{equation*} \begin{array}{cc} \tilde{{\mathcal Y}}_\mathrm{h} = \begin{bmatrix} 1 & 1 & 2 & 2 \\ 1 & 3 & 1 & 1 \\ 60 & 68 & 68 & 67 \\ 1 & 2 & 1 & 1 \end{bmatrix}, & \tilde{{\mathcal Y}}_\mathrm{p} = \begin{bmatrix} 1 & 1 & 46 & 1 \\ 3 & 1 & 50 & 2 \\ 2 & 1 & 65 & 1 \\ 2 & 1 & 65 & 1 \end{bmatrix} \end{array} \end{equation*} Compute the binary masks ${\mathcal M}_\mathrm{h}$ and ${\mathcal M}_\mathrm{p}$ and apply them to the matrix \begin{equation*} {\mathcal X} = \begin{bmatrix} 1 & 1 & 46 & 2 \\ 3 & 1 & 50 & 1 \\ 60 & 68 & 70 & 67 \\ 2 & 1 & 65 & 1 \end{bmatrix} \end{equation*}
In [ ]:
Y_h_toy = np.array([[1 ,1, 2, 2], [1, 3, 1, 1], [60, 68, 68, 67], [1, 2, 1, 1]])
Y_p_toy = np.array([[1, 1, 46, 1], [3, 1, 50, 2], [2, 1, 65, 1], [2, 1, 65, 1]])
X_toy = np.array([[1, 1, 46, 2], [3, 1, 50, 1], [60, 68, 70, 67], [2, 1, 65, 1]])

# your code here...
Lab Experiment 3
Use the median filtered power spectrograms $\tilde{{\mathcal Y}}_\mathrm{h}$ and $\tilde{{\mathcal Y}}_\mathrm{p}$ from the previous exercise (filter length 11) to compute the binary masks ${\mathcal M}_\mathrm{h}$ and ${\mathcal M}_\mathrm{p}$.
In [ ]:
# your code here...
Visualize the masks (this time without logarithmic compression).
In [ ]:
# your code here...
Apply the masks to the original STFT ${\mathcal X}$ to compute ${\mathcal X}_\mathrm{h}$ and ${\mathcal X}_\mathrm{p}$.
In [ ]:
# your code here...
Visualize the power spectrograms ${\mathcal Y}_\mathrm{h}$ and ${\mathcal Y}_\mathrm{p}$ of ${\mathcal X}_\mathrm{h}$ and ${\mathcal X}_\mathrm{p}$.
In [ ]:
# your code here...

In the final step, we need to transform our constructed STFTs ${\mathcal X}_\mathrm{h}$ and ${\mathcal X}_\mathrm{p}$ back to the time-domain. To this end, we apply an "inverse" STFT to these matrices to compute the component signals $x_\mathrm{h}$ and $x_\mathrm{p}$. Note that the topic "inversion of the STFT" is not as trivial as it might seem at the first glance. In the case that ${\mathcal X}$ is the original STFT of an audio signal $x$, and further preconditions are satisfied (for example that $N \geq H$ for $N$ being the size of the discrete Fourier transform and $H$ being the hopsize of the STFT), it is possible to invert the STFT and to reconstruct $x$ from ${\mathcal X}$ perfectly. However, as soon as the original STFT ${\mathcal X}$ has been modified to some $\tilde{{\mathcal X}}$, for example by masking, there might be no audio signal which has exactly $\tilde{{\mathcal X}}$ as its STFT. In such a case, one usually aims to find an audio signal whose STFT is "approximately" $\tilde{{\mathcal X}}$. For this Lab Course, you can simply assume that you can invert the STFT using librosa.istft.

Homework Excercise 6
Assume ${\mathcal X}$ is the original STFT of some audio signal $x$. Why do we need the precondition $N \geq H$ for $N$ being the size of the discrete Fourier transform and $H$ being the hopsize of the STFT to reconstruct $x$ from ${\mathcal X}$ perfectly?
Lab Experiment 4
Apply the inverse STFT function librosa.istft to $X_\mathrm{h}$ and $X_\mathrm{p}$ from the previous experiment and listen to the results.
In [ ]:
from IPython.display import display

# your code here...

display(Audio(x_h, rate=Fs))
display(Audio(x_p, rate=Fs))
Save the computed harmonic and percussive component by using sf.write.
In [ ]:
from IPython.display import FileLink

# your code here...

display(FileLink('x_h.wav'))
display(FileLink('x_p.wav'))

Note that one can specify the filter lengths of the harmonic and percussive median filters in seconds and Hertz, respectively. This makes their physical interpretation easier. Given the sampling rate $F_\mathrm{s}$ of the input signal $x$ as well as the frame length $N$ and the hopsize $H$, we can convert filter lengths given in seconds and Hertz to filter lengths given in indices

\begin{eqnarray} L_\mathrm{h}(t):=\left\lceil \frac{F_\mathrm{s}}{H} t \right\rceil\label{eqn:L_h}\\ L_\mathrm{p}(d):=\left\lceil \frac{N}{F_\mathrm{s}} d \right\rceil\label{eqn:L_p} \end{eqnarray}

Homework Excercise 7
Assume $F_\mathrm{s}=22050$ Hz, $N=1024$, and $H=256$. Compute $L_\mathrm{h}(0.5\text{ sec})$ and $L_\mathrm{p}(600 \text{ Hz})$.
In [ ]:
# your code here...
Lab Experiment 5
Complete the implementation of the HPSS algorithm in the function HPSS:
  1. Compute the STFT ${\mathcal X}$ of the input signal $x$ using librosa.stft .
  2. Compute the power spectrogram ${\mathcal Y}$ from the ${\mathcal X}$.
  3. Convert the median filter lengths from seconds and Hertz to indices. if a filter length is even, subtract 1.
  4. Apply median filters to ${\mathcal Y}$ using scipy.signal.medfilt to compute ${\mathcal Y}_\mathrm{h}$ and ${\mathcal Y}_\mathrm{p}$.
  5. Derive the masks ${\mathcal M}_\mathrm{h}$ and ${\mathcal M}_\mathrm{p}$ from ${\mathcal Y}_\mathrm{h}$ and ${\mathcal Y}_\mathrm{p}$.
  6. Compute ${\mathcal X}_\mathrm{h}$ and ${\mathcal X}_\mathrm{p}$.
  7. Apply the inverse STFT (librosa.istft ) to get $x_\mathrm{h}$ and $x_\mathrm{p}$.
In [ ]:
def HPSS(x, N, H, w, Fs, lh_sec, lp_Hz):
    # x:      Input signal
    # N:      Frame length
    # H:      Hopsize
    # w:      Window function of length N
    # Fs:     Sampling rate of x
    # lh_sec: Horizontal median filter length given in seconds
    # lp_Hz:  Percussive median filter length given in Hertz

    # stft

    # power spectrogram

    # median filtering

    # masking

    # istft

    return x_h, x_p
Test your implementation:
  1. Load the audio files Stepdad.wav, Applause.wav, and DrumSolo.wav from the Data folder.
  2. Apply HPSS using the parameters N=1024, H=512, a hann-window, lh_sec=0.2, and lp_Hz=500 to all loaded signals.
  3. Listen to the results.
In [ ]:
for file in ('files/Stepdad.wav', 'files/Applause.wav', 'files/DrumSolo.wav'):
    print('# ' + file)
    
    # your code here...
    
    print('Orignal')
    display(Audio(x, rate=Fs))
    print('Harmonic Component')
    display(Audio(x_h, rate=Fs))
    print('Percussive Component')
    display(Audio(x_p, rate=Fs))
    print('')

In many audio processing tasks, the essential information lies in either the harmonic or the percussive component of an audio signal. In such cases, HPSS is very well suited as a pre-processing step to enhance the outcome of an algorithm. In the following, we introduce two procedures that can be improved by applying HPSS. The harmonic component from the HPSS algorithm can be used to enhance chroma features (Subsection Chroma) and the percussive component helps to improve the results of an onset detection procedure (Subsection Onset).

Two pitches sound similar when they are an octave apart from each other (12 tones in the equal tempered scale). We say that these pitches share the same chroma which we refer to by the pitch spelling names $\{\mathrm{C},\mathrm{C}^{\sharp},\mathrm{D},\mathrm{D}^{\sharp},\mathrm{E},\mathrm{F},\mathrm{F}^{\sharp},\mathrm{G},\mathrm{G}^{\sharp},\mathrm{A},\mathrm{A}^{\sharp},\mathrm{B}\}$. Chroma features exploit the above observation, by adding up all frequency bands in a power spectrogram that belong to the same chroma. Technically this can be realized by the following procedure. First we assign a pitch index (MIDI pitch number) to each frequency index $k \in [1:N/2-1]$ of the spectrogram by using the formula:

\begin{equation} p(k) = \text{round}\left(12\log_2\left( \frac{ k\cdot F_\mathrm{s}}{440 \cdot N}\right)\right)+69. \end{equation}

where $N$ is the number of frequency bins in the spectrogram and $F_\mathrm{s}$ is the sampling rate of the audio signal. Note that $p$ maps frequency indices corresponding to frequencies around the chamber tone A4 (440 Hz) to its MIDI pitch number 69. Then we add up all frequency bands in the power spectrogram belonging to the same chroma $c \in [0:11]$:

\begin{equation} {\mathcal C}(m,c) := \sum_ {\{k |\,p(k)\,\mathrm{mod}\,12 = c\,\}}{{\mathcal Y}(m,k)} \end{equation}

where $m\in [0:M-1]$ and $M$ is the number of frames.

Chroma features are correlated with the pitches and the harmonic structure of music. Pitches usually form horizontal structures in the spectrogram, whereas transient or percussive sounds form vertical structures. Percussive sounds have a negative impact on the chroma extraction, as they "activate all frequencies" in the spectrogram. Hence, one way to improve the chroma extraction is to first apply HPSS and to perform the chroma extraction on the power spectrogram of the harmonic component signal ${\mathcal Y}_\mathrm{h}(m,k) = |{\mathcal X_\mathrm{h}(m,k)}|^2$.

Lab Experiment 6
Apply the HPSS algorithm as a pre-processing step in a chroma extraction procedure:
  1. Load the file CastanetsViolin.wav.
  2. Compute chroma features on $x$ with the parameters N=4410 and H=2205.
  3. Visualize the chroma features.
  4. Apply your HPSS algorithm to separate the castanets from the violin.
  5. Use the harmonically enhanced signal $x_\mathrm{h}$ to compute chroma features and visualize them.
  6. Now compare the visualization of the chroma extracted from the original signal $x$ and the chroma extracted from the harmonic component signal $x_\mathrm{h}$. What do you observe?
In [ ]:
from numpy.linalg import norm

def simple_chroma(x, N, H, Fs):
    # x:       input signal
    # N:       frame length
    # H:       hopsize
    # Fs:      sampling rate
    
    X = librosa.stft(x, N, H, N, window='hann', pad_mode='constant')
    Y = np.abs(X) ** 2
    Y_comp = np.log(1 + 0.5 * Y)
    C = np.zeros((12, Y.shape[1]), dtype=Y.dtype)
    n = np.arange(1, Y.shape[0])
    pitches = np.round(12 * np.log2((Fs * n/N) / 440)) + 69
    chroma_mapping = pitches % 12
    chroma_mapping = np.insert(chroma_mapping, 0, -1) # never use DC offset
    for k in range(12):
        C[k, :] = Y_comp[chroma_mapping == k, :].sum(axis=0)
    C = normalize_chroma(C, 2, 0.0001)
        
    return C

def normalize_chroma(C, norm_p, threshold):
    f_norm = norm(C, norm_p, axis=0)
    C_norm = C.copy()
    C_norm[:, f_norm >= threshold] /= f_norm
    return C_norm

# your code here...

Onset detection is the task of finding the temporal positions of note onsets in a music recording. More concrete, the task could be to detect all time positions on which some drum is hit in a recording of a rock song. One way to approach this problem is to assume, that drum hits emit a short burst of high energy and the goal is therefore to detect these bursts in the input signal. To this end, one first computes the short-time power ${\mathcal P}$ of the input signal $x$ by

\begin{equation} {\mathcal P}(m):= \sum_{n=0}^{N-1} x(n + m H)^2 \quad \end{equation}

where $H$ is the hopsize and $N$ is the length of one frame (similar to the computation of the STFT). Since we are looking for time-positions of high energy, the goal is therefore to detect peaks in ${\mathcal P}$. A common technique to enhance peaks in a sequence is to subtract the local average $\tilde{{\mathcal P}}$ from ${\mathcal P}$ itself. $\tilde{{\mathcal P}}$ is defined by

\begin{equation} \tilde{{\mathcal P}}(m) := \sum_{j=-J}^{J}{\mathcal P}(m+j) \frac{1}{2J+1} \end{equation}

for a neighborhood $J\in{\mathbb N}$, $m\in[0:M-1]$, and $M$ is the number of frames. Note that we assume ${\mathcal P}(m) = 0$ for $m\notin [0:M-1]$. From this, we compute a novelty curve ${\mathcal N}$

\begin{equation} {\mathcal N}(m) := \max(0,{\mathcal P}(m) - \tilde{{\mathcal P}}(m)) \end{equation}

The peaks in ${\mathcal N}$ indicate positions of high energy in $x$, and are therefore potential time positions of drum hits.

This procedure works well in case the initial assumption, namely that onsets or drum hits emit some burst of energy which stand out from the remaining energy in the signal, is met. However, especially in professionally mixed music recordings, the short-time energy is often adjusted to be more or less constant over time (compression). One possibility to circumvent this problem is to apply HPSS to the input signal prior to the onset detection. The onset detection is then executed solely on the percussive component which usually contains all drum hits and satisfies the assumption of having energy bursts at the respective time-positions.

Lab Experiment 7
Complete the implementation of the onset detection algorithm in the function onset_detection:
  1. Compute the short-time power ${\mathcal P}$ of the input signal $x$ using the provided function stp.
  2. Compute the local average $\tilde{{\mathcal P}}$. (Hint: Note that the equation can be formulated as a convolution and that you can compute convolutions in Python using the function np.convolve. Note further that this command has an keyword mode. Finally, have a look at the function np.ones.)
  3. Compute the novelty curve ${\mathcal N}$.
In [ ]:
def stp(x, N, H):
    # x:  Input signal
    # N:  Frame length
    # H:  Hopsize
    x_pad = np.concatenate((x, np.zeros(N)))
    num_windows = np.ceil(1 + (len(x) - N) / H)
    win_pos = np.arange(num_windows) * H
    idx = np.array([np.arange(w, w+N) for w in win_pos], dtype='int32')
    P = (x_pad[idx] ** 2).sum(axis=1) / N
    return P
    
def onset_detection(x, N, H, J):
    # x:      Input signal
    # N:      Frame length
    # H:      Hopsize
    # J:      Neighborhood
    
    # short-time power

    # local average

    # novelty
    
    return N
Test your implementation by applying it to the audio file StillPluto_BitterPill.wav. As a starting point, use $N=882$, $H = 441$, and $J = 10$.
In [ ]:
# your code here...

plt.figure(figsize=(16, 10.66))
plt.plot(t, novelty)
Sonify your results using the function sonify_noveltyCurve. This function will generate a stereo audio signal in which you can hear the provided original signal in one of the channels. In the other channel, each peak in the provided novelty curve is audible as a click sound. You can therefore check by listening whether the peaks in your computed novelty curve are aligned with drum hits in the original signal. To apply the function sonify_noveltyCurve, you need to specify the sampling frequency of the novelty curve. How can you compute it? (Hint: It is dependent on $H$ and the sampling frequency $F_\mathrm{s}$ of the input audio signal).
In [ ]:
def sonify_noveltyCurve(novelty, x, Fs, sampling_frequency_novelty):
    pos = np.append(novelty, novelty[-1]) > np.insert(novelty, 0, novelty[0])
    neg = np.logical_not(pos)
    peaks = np.where(np.logical_and(pos[:-1], neg[1:]))[0]
    
    values = novelty[peaks]
    values /= np.max(values)
    peaks = peaks[values >= 0.01]
    values = values[values >= 0.01]
    peaks_idx = np.int32(np.round(peaks / sampling_frequency_novelty * Fs))
    
    sine_periods = 8
    sine_freq = 880
    click = np.sin(np.linspace(0, sine_periods * 2 * np.pi, sine_periods * Fs//sine_freq))
    ramp = np.linspace(1, 1/len(click), len(click)) ** 2
    click = click * ramp
    click = (click * np.abs(np.max(x)))
    
    out = np.zeros(len(x), dtype=x.dtype)
    for i, start in enumerate(peaks_idx):
        idx = np.arange(start, start+len(click))
        out[idx] += (click * values[i]).astype(x.dtype)
    
    return np.vstack((x, out))              

# your code here...

Audio(newx, rate=Fs)
Listen to the generated results. What is your impression?
Now apply your HPSS algorithm to the audio file and rerun the detection algorithm on just the percussive component $x_\mathrm{p}$. Again, sonify the results. What is your impression now?
In [ ]:
# your code here...

plt.figure(figsize=(16, 10.66))
plt.plot(t, novelty)

newx = sonify_noveltyCurve(novelty, x, Fs, Fs/H)
Audio(newx, rate=Fs)

Acknowledgment: The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS.