FMP AudioLabs
C8

Salience Representation


Following Section 8.2.2 of [Müller, FMP, Springer 2015], we introduce in this notebook an enhanced log-frequency spectrogram referred to as salience representation. Exploiting the instantaneous frequency, we show how the log-frequency spectrogram can be improved and refined, in particular in the low-frequency part of the spectrum. The salience representation as introduced in this notebook is inspired by the work of Salamon and Gómez.

  • Justin Salamon and Emilia Gómez: Melody Extraction from Polyphonic Music Signals using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (2012), pp. 1759–1770.
    Bibtex

Running Example: Freischütz

As our running example throughout this notebook, we use a short excerpt of an aria from the opera "Der Freischütz" by Carl Maria von Weber. The main melody of this excerpt is performed by a soprano singer.

FMP_C8_F10a


The following figure shows a logarithmically compressed spectrogram of the audio excerpt as well as a zoomed-in time–frequency section. In view of the instantaneous frequency to be used, we chose a relatively small hop size parameter $H$.

In [1]:
import numpy as np
import os, sys, librosa
from scipy import signal
from scipy import ndimage
from matplotlib import pyplot as plt
import IPython.display as ipd
from numba import jit

sys.path.append('..')
import libfmp.b
import libfmp.c2
import libfmp.c3
import libfmp.c6
import libfmp.c8

%matplotlib inline

# Load wav
fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_F10_Weber_Freischuetz-06_FreiDi-35-40.wav')
Fs = 22050
x, Fs = librosa.load(fn_wav, sr=Fs)

# Computation of STFT
N = 1024
H = 128
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, pad_mode='constant')
gamma = 1
Y = np.log(1 + gamma * np.abs(X))

figsize = (10,4)
fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2, 1]}, figsize=figsize)        

ylim_zoom_pitch = [65, 90] 
ylim_zoom = [libfmp.c3.f_pitch(ylim_zoom_pitch[0]), libfmp.c3.f_pitch(ylim_zoom_pitch[1])]
xlim_zoom = [0,1]
cmap = libfmp.b.compressed_gray_cmap(alpha=5)

libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs, ax=[ax[0]], title='Spectrogram', 
                     colorbar=True, cmap=cmap);
ax[0].set_ylim([0, 5000])
libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs, ax=[ax[1]], title='', 
                     colorbar=True, cmap=cmap);
ax[1].set_ylim(ylim_zoom)
ax[1].set_xlim(xlim_zoom)

plt.tight_layout()

Log-Frequency Spectrogram

As preparation, let us recall the log-frequency spectrogram (Section 3.1.1 of [Müller, FMP, Springer 2015]. Let $x$ denote an audio signal sampled at a rate of $F_\mathrm{s}$ and $\mathcal{X}$ its STFT using a window length $N\in\mathbb{N}$ and hop size $H\in\mathbb{N}$. The frequency index $k\in[0:N/2]$ corresponds to

\begin{equation} F_\mathrm{coef}(k) := \frac{k\cdot F_\mathrm{s}}{N} \end{equation}

given Hertz. To obtain a log-frequency spectrogram, one strategy is to pool or bin the STFT coefficients regarding the sets

\begin{equation} P(p) := \{k:F_\mathrm{MIDI}(p-0.5) \leq F_\mathrm{coef}(k) < F_\mathrm{MIDI}(p+0.5)\} \end{equation}

for pitch parameters $p\in[0:127]$. The center frequencies are given by

\begin{equation} F_\mathrm{MIDI}(p) = 2^{(p-69)/12} \cdot 440. \end{equation}

Instead of fixing a pitch and looking for all frequencies that lie in the resulting pitch band, one can also define a mapping $\mathrm{Bin}:\mathbb{R}\to\mathbb{Z}$ that assigns a pitch index to a given frequency:

\begin{equation} \mathrm{Bin}(\omega) := \left\lfloor 12\cdot\log_2\left(\frac{\omega}{440}\right)+69.5\right\rfloor. \end{equation}

Using this function, the binning can be expressed by

\begin{equation} P(p) := \{k: \mathrm{Bin}(F_\mathrm{coef}(k))=p\}. \end{equation}

From this, we obtain a log-frequency spectrogram $\mathcal{Y}_\mathrm{LF}:\mathbb{Z}\times [0:127]$ via pooling:

\begin{equation} \mathcal{Y}_\mathrm{LF}(n,p) := \sum_{k \in P(p)}{|\mathcal{X}(n,k)|^2}. \end{equation}

The next code cell computes a log-frequency spectrogram of our Freischütz example using the implementation provided by the FMP notebook on the log-frequency spectrogram and chromagram.

In [2]:
Y_LF, F_coef_hertz, F_coef_cents = libfmp.c8.compute_y_lf_bin(Y, Fs, N, R=100, 
                                                              F_min=32.703, F_max=11025)  

fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2, 1]}, figsize=figsize)         

libfmp.b.plot_matrix(Y_LF, Fs=Fs/H, Fs_F=1, ax=[ax[0]], ylabel='Frequency (pitch)',
                     title='LF-spectrogram (semitone resolution)', 
                     colorbar=True, cmap=cmap);
ax[0].set_ylim([33, 110])
libfmp.b.plot_matrix(Y_LF, Fs=Fs/H, Fs_F=1, ax=[ax[1]], ylabel='Frequency (pitch)',
                     title='', colorbar=True, cmap=cmap);
ax[1].set_ylim(ylim_zoom_pitch)
ax[1].set_xlim(xlim_zoom)
plt.tight_layout()

Refined Binning

We now extend the definition of the log-frequency spectrogram by considering a more general bin assignment. To this end, let $\omega_\mathrm{ref}\in\mathbb{R}$ be a reference frequency which is to be assigned to the bin index $1$. Furthermore, let $R\in\mathbb{R}$ (given in cents) be the desired resolution of the logarithmically spaced frequency axis. Then, for a frequency $\omega\in\mathbb{R}$ (given in Hertz), the bin index $\mathrm{Bin}(\omega)$ is defined as

\begin{equation} \label{eq:AudioDeco:Mel:SalSpec:binAssign} \mathrm{Bin}(\omega) := \left\lfloor \frac{1200}{R} \cdot\log_2\left(\frac{\omega}{\omega_\mathrm{ref}}\right)+1.5\right\rfloor. \end{equation}

For example, $R=100$ yields a subdivision of the frequency axis with a resolution of $100$ cents (one semitone) per bin. Using $R=10$ results in a finer subdivision of the frequency axis, where each bin corresponds to $10$ cents (a tenth of a semitone). Based on the bin mapping function, we now extend the definition of the log-frequency spectrogram above. Fixing a reference frequency $\omega_\mathrm{ref}$ and a resolution $R$, let $B\in\mathbb{N}$ be the number of bins to be considered. For each bin index $b\in[1:B]$, we then define the set

\begin{equation} P(b) := \left\{k: \mathrm{Bin}\left(F_\mathrm{coef}{(k)}\right)=b\right\}. \end{equation}

Furthermore, starting with the spectrogram $\mathcal{Y} = |\mathcal{X}(n,k)|^2$ (or a logarithmically compressed version of it), we set

\begin{equation} \label{eq:AudioDeco:Mel:SalSpec:SpecLogFreq} \mathcal{Y}_\mathrm{LF}(n,b) := \sum_{k \in P(b)}{\mathcal{Y}(n,k)} \end{equation}

for each frame index $n\in\mathbb{Z}$ and bin index $b\in[1:B]$. The refined binning strategy is implemented in the next code cell and tested on our Freischütz example. As for the implementation, note the following:

  • As input, we use the logarithmically compressed spectrogram computed above.
  • Only the frequencies between $\omega_\mathrm{min}=55~\mathrm{Hz}$ (corresponding to pitch $p=33$) and $\omega_\mathrm{max}=1760~\mathrm{Hz}$ (corresponding to pitch $p=93$) are considered. This range covers five octaves (corresponding to $6000$ cents).
  • As for the binning index $b\in[1:B]$, we deviate in the Python implementation from the theory by starting indexing with index 0. This leads to a systematic index shift of minus one in relation to the algorithmic description given above.
  • The reference frequency is set to $\omega_\mathrm{ref}= \omega_\mathrm{min}$ and the resolution to $R=50$ cents.
  • In the visualization of the log-frequency spectrogram, the frequency axis is specified in cents (with $0$ cents corresponding to $\omega_\mathrm{ref}=55~\mathrm{Hz}$.
  • In particular for low frequencies, the bins may be empty due to the linear frequency grid introduced by the STFT. The problem of empty bins may be resolved by increasing the frequency grid resolution or by using interpolation techniques.
In [3]:
@jit(nopython=True)
def f_coef(k, Fs, N):
    """STFT center frequency

    Notebook: C8/C8S2_SalienceRepresentation.ipynb

    Args:
        k (int): Coefficient number
        Fs  (scalar): Sampling rate in Hz
        N (int): Window length in samples

    Returns:
        freq (float): STFT center frequency
    """
    return k * Fs / N

@jit(nopython=True)
def frequency_to_bin_index(F, R=10.0, F_ref=55.0):
    """| Binning function with variable frequency resolution
    | Note: Indexing starts with 0 (opposed to [FMP, Eq. (8.49)])

    Notebook: C8/C8S2_SalienceRepresentation.ipynb

    Args:
        F (float): Frequency in Hz
        R (float): Frequency resolution in cents (Default value = 10.0)
        F_ref (float): Reference frequency in Hz (Default value = 55.0)

    Returns:
        bin_index (int): Index for bin (starting with index 0)
    """
    bin_index = np.floor((1200 / R) * np.log2(F / F_ref) + 0.5).astype(np.int64)
    return bin_index

@jit(nopython=True)
def p_bin(b, freq, R=10.0, F_ref=55.0):
    """Computes binning mask [FMP, Eq. (8.50)]

    Notebook: C8/C8S2_SalienceRepresentation.ipynb

    Args:
        b (int): Bin index
        freq (float): Center frequency
        R (float): Frequency resolution in cents (Default value = 10.0)
        F_ref (float): Reference frequency in Hz (Default value = 55.0)

    Returns:
        mask (float): Binning mask
    """
    mask = frequency_to_bin_index(freq, R, F_ref) == b
    mask = mask.reshape(-1, 1)
    return mask


@jit(nopython=True)
def compute_y_lf_bin(Y, Fs, N, R=10.0, F_min=55.0, F_max=1760.0):
    """Log-frequency Spectrogram with variable frequency resolution using binning

    Notebook: C8/C8S2_SalienceRepresentation.ipynb

    Args:
        Y (np.ndarray): Magnitude spectrogram
        Fs (scalar): Sampling rate in Hz
        N (int): Window length in samples
        R (float): Frequency resolution in cents (Default value = 10.0)
        F_min (float): Lower frequency bound (reference frequency) (Default value = 55.0)
        F_max (float): Upper frequency bound (is included) (Default value = 1760.0)

    Returns:
        Y_LF_bin (np.ndarray): Binned log-frequency spectrogram
        F_coef_hertz (np.ndarray): Frequency axis in Hz
        F_coef_cents (np.ndarray): Frequency axis in cents
    """
    # [FMP, Eq. (8.51)]
    B = frequency_to_bin_index(np.array([F_max]), R, F_min)[0] + 1
    F_coef_hertz = 2 ** (np.arange(0, B) * R / 1200) * F_min
    F_coef_cents = np.arange(0, B*R, R)
    Y_LF_bin = np.zeros((B, Y.shape[1]))

    K = Y.shape[0]
    freq = f_coef(np.arange(0, K), Fs, N)
    freq_lim_idx = np.where(np.logical_and(freq >= F_min, freq <= F_max))[0]
    freq_lim = freq[freq_lim_idx]
    Y_lim = Y[freq_lim_idx, :]

    for b in range(B):
        coef_mask = p_bin(b, freq_lim, R, F_min)
        Y_LF_bin[b, :] = (Y_lim*coef_mask).sum(axis=0)
    return Y_LF_bin, F_coef_hertz, F_coef_cents

R = 50
F_min = 55.0
F_max = 1760.0

fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2, 1]}, figsize=figsize)    

Y_LF_bin, F_coef_hertz, F_coef_cents = compute_y_lf_bin(Y, Fs, N, R, F_min=F_min, F_max=F_max)


libfmp.b.plot_matrix(Y_LF_bin, Fs=Fs/H, F_coef=F_coef_cents, ax=[ax[0]], ylabel='Frequency (cents)',
                     title='LF-spectrogram (R = 50 cents)', colorbar=True, cmap=cmap);

libfmp.b.plot_matrix(Y_LF_bin, Fs=Fs/H, F_coef=F_coef_cents, ax=[ax[1]], ylabel='Frequency (cents)',
                     title='', colorbar=True, cmap=cmap);

ylim_zoom_cents = [3200, 5700] 
xlim_zoom = [0,1]
ax[1].set_xlim(xlim_zoom)
ax[1].set_ylim(ylim_zoom_cents)
plt.tight_layout()

Using Instantaneous Frequency

In the simple binning approach as introduced above, the linearly spaced frequency information in $\mathcal{Y}$ is expanded in a nonlinear, logarithmic fashion. This results in artifacts in the frequency direction (e.g., the horizontal white stripes), which are visible particularly in the lower part of $\mathcal{Y}_\mathrm{LF}$. We now discuss how this problem can be alleviated by using the instantaneous frequency (see also Section 8.2.1 of [Müller, FMP, Springer 2015]). Instead of taking the center frequencies $F_\mathrm{coef}(k)$, the idea is to employ the refined frequency estimates $F_\mathrm{coef}^\mathrm{IF}(k,n)$ for defining the sets

\begin{equation} P^\mathrm{IF}(b,n) := \left\{k: \mathrm{Bin}\left(F_\mathrm{coef}^\mathrm{IF}(k, n)\right)=b\right\} \end{equation}

for $b\in[1:B]$ and $n\in\mathbb{Z}$. From this new bin assignment, we derive a refined log-frequency spectrogram $\mathcal{Y}_\mathrm{LF}^\mathrm{IF}$ by setting

\begin{equation} \label{eq:AudioDeco:Mel:SalSpec:SpecLogFreqIF} \mathcal{Y}_\mathrm{LF}^\mathrm{IF}(n,b) := \sum_{k \in P^\mathrm{IF}(b,n)}{\mathcal{Y}(n,k)} \end{equation}

for each frame index $n\in\mathbb{Z}$ and bin index $b\in[1:B]$. The effect of this modification is illustrated by the following code cell. Note that using the instantaneous frequency alleviates some of the problems introduced by the linear frequency grid of the STFT. In particular time–frequency patterns become sharper when spectral coefficients can be clearly assigned to a single harmonic source.

In [4]:
@jit(nopython=True)
def p_bin_if(b, F_coef_IF, R=10.0, F_ref=55.0):
    """Computes binning mask for instantaneous frequency binning [FMP, Eq. (8.52)]

    Notebook: C8/C8S2_SalienceRepresentation.ipynb

    Args:
        b (int): Bin index
        F_coef_IF (float): Instantaneous frequencies
        R (float): Frequency resolution in cents (Default value = 10.0)
        F_ref (float): Reference frequency in Hz (Default value = 55.0)

    Returns:
        mask (np.ndarray): Binning mask
    """
    mask = frequency_to_bin_index(F_coef_IF, R, F_ref) == b
    return mask

@jit(nopython=True)
def compute_y_lf_if_bin(X, Fs, N, H, R=10, F_min=55.0, F_max=1760.0, gamma=0.0):
    """Binned Log-frequency Spectrogram with variable frequency resolution based on instantaneous frequency

    Notebook: C8/C8S2_SalienceRepresentation.ipynb

    Args:
        X (np.ndarray): Complex spectrogram
        Fs (scalar): Sampling rate in Hz
        N (int): Window length in samples
        H (int): Hopsize in samples
        R (float): Frequency resolution in cents (Default value = 10)
        F_min (float): Lower frequency bound (reference frequency) (Default value = 55.0)
        F_max (float): Upper frequency bound (Default value = 1760.0)
        gamma (float): Logarithmic compression factor (Default value = 0.0)

    Returns:
        Y_LF_IF_bin (np.ndarray): Binned log-frequency spectrogram using instantaneous frequency
        F_coef_hertz (np.ndarray): Frequency axis in Hz
        F_coef_cents (np.ndarray): Frequency axis in cents
    """
    # Compute instantaneous frequencies
    F_coef_IF = libfmp.c8.compute_if(X, Fs, N, H)
    freq_lim_mask = np.logical_and(F_coef_IF >= F_min, F_coef_IF < F_max)
    F_coef_IF = F_coef_IF * freq_lim_mask

    # Initialize ouput array and compute frequency axis
    B = frequency_to_bin_index(np.array([F_max]), R, F_min)[0] + 1
    F_coef_hertz = 2 ** (np.arange(0, B) * R / 1200) * F_min
    F_coef_cents = np.arange(0, B*R, R)
    Y_LF_IF_bin = np.zeros((B, X.shape[1]))

    # Magnitude binning
    if gamma == 0:
        Y = np.abs(X) ** 2
    else:
        Y = np.log(1 + np.float32(gamma)*np.abs(X))
    for b in range(B):
        coef_mask = p_bin_if(b, F_coef_IF, R, F_min)

        Y_LF_IF_bin[b, :] = (Y * coef_mask).sum(axis=0)
    return Y_LF_IF_bin, F_coef_hertz, F_coef_cents


R = 50
F_min = 55.0
F_max = 1760.0
Y_LF_IF_bin, F_coef, F_coef_cents = compute_y_lf_if_bin(X, Fs, N, H, R=R, 
                                                        F_min=F_min, F_max=F_max, gamma=1)

fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2, 1]}, figsize=figsize)  
libfmp.b.plot_matrix(Y_LF_IF_bin, Fs=Fs/H, F_coef=F_coef_cents, ax=[ax[0]], ylabel='Frequency (cents)',
                     title=r'IF-based LF-spectrogram ($R = %0.0f$ cents)'%R, colorbar=True, cmap=cmap);

libfmp.b.plot_matrix(Y_LF_IF_bin, Fs=Fs/H, F_coef=F_coef_cents, ax=[ax[1]], ylabel='Frequency (cents)',
                     title='', colorbar=True, cmap=cmap);
ax[1].set_xlim(xlim_zoom)
ax[1].set_ylim(ylim_zoom_cents)
plt.tight_layout()

In the following figure, we show the IF-based log-frequency spectrogram using a resolution of $R=10$ cents. Note that, even at this high frequency resolution, the artifacts due to empty frequency bins are barely visible.

In [5]:
R = 10
Y_LF_IF_bin, F_coef, F_coef_cents = compute_y_lf_if_bin(X, Fs, N, H, R=R, 
                                                        F_min=F_min, F_max=F_max, gamma=1)

fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2, 1]}, figsize=figsize) 
libfmp.b.plot_matrix(Y_LF_IF_bin, Fs=Fs/H, F_coef=F_coef_cents, ax=[ax[0]], ylabel='Frequency (cents)', 
                     title='IF-based LF-spectrogram ($R = %0.0f$ cents)'%R, colorbar=True, cmap=cmap);

libfmp.b.plot_matrix(Y_LF_IF_bin, Fs=Fs/H, F_coef=F_coef_cents, ax=[ax[1]], ylabel='Frequency (cents)', 
                     title='', colorbar=True, cmap=cmap);
ax[1].set_xlim(xlim_zoom)
ax[1].set_ylim(ylim_zoom_cents)
plt.tight_layout()

Harmonic Summation

Recall that a sound event such as a musical tone is associated to a fundamental frequency along with its harmonic partials, which are (approximately) the integer multiples of the fundamental frequency. Therefore, a spectrogram representation of a recorded melody typically exhibits an entire family of frequency trajectories which are stacked on top of each other. The multiple appearance of tonal time–frequency patterns can be exploited to improve a spectrogram representation. The idea is to jointly consider a frequency and its harmonics by forming suitably weighted sums—a technique also called harmonic summation. Let $H\in\mathbb{N}$ be the number of harmonics to be considered in the summation. Then, given a spectrogram representation $\mathcal{Y}$, we define a harmonic-sum spectrogram $\tilde{\mathcal{Y}}$ by setting

\begin{equation} \tilde{\mathcal{Y}}(n,k) := \sum_{h=1}^{H} \alpha^{h-1} \cdot \mathcal{Y}(n,k\cdot h) \end{equation}

for $n,k\in\mathbb{Z}$ (assuming that $\mathcal{Y}$ is suitably zero-padded in frequency direction). In the summation, harmonics may be weighted exponentially using the weighting parameter $\alpha \in [0, 1]$.

In [6]:
@jit(nopython=True)
def harmonic_summation(Y, num_harm=10, alpha=1.0):
    """Harmonic summation for spectrogram [FMP, Eq. (8.54)]

    Notebook: C8/C8S2_SalienceRepresentation.ipynb

    Args:
        Y (np.ndarray): Magnitude spectrogram
        num_harm (int): Number of harmonics (Default value = 10)
        alpha (float): Weighting parameter (Default value = 1.0)

    Returns:
        Y_HS (np.ndarray): Spectrogram after harmonic summation
    """
    Y_HS = np.zeros(Y.shape)
    Y_zero_pad = np.vstack((Y, np.zeros((Y.shape[0]*num_harm, Y.shape[1]))))
    K = Y.shape[0]
    for k in range(K):
        harm_idx = np.arange(1, num_harm+1)*(k)
        weights = alpha ** (np.arange(1, num_harm+1) - 1).reshape(-1, 1)
        Y_HS[k, :] = (Y_zero_pad[harm_idx, :] * weights).sum(axis=0)
    return Y_HS

Y_HS = harmonic_summation(Y, num_harm=10)

fig, ax = plt.subplots(1, 2, gridspec_kw={'width_ratios': [2, 1]}, figsize=figsize)        

libfmp.b.plot_matrix(Y_HS, Fs=Fs/H, Fs_F=N/Fs, ax=[ax[0]], title='Spectrogram after harmonic summation', colorbar=True, cmap=cmap);
ax[0].set_ylim([0, 5000])
libfmp.b.plot_matrix(Y_HS, Fs=Fs/H, Fs_F=N/Fs, ax=[ax[1]], title='', colorbar=True, cmap=cmap);
ax[1].set_ylim(ylim_zoom)
ax[1].set_xlim(xlim_zoom)

plt.tight_layout()