Log-Frequency Spectrogram and Chromagram

Following Section 3.1.1 of [Müller, FMP, Springer 2015], we introduce in this notebook two feature representations known as log-frequency spectrogram and chromagram.

STFT and Pitch Frequencies¶

Assuming that we are dealing with music whose pitches can be meaningfully categorized according to the equal-tempered scale, we show how an audio recording can be transformed into a feature representation that reveals the distribution of the signal's energy across the different pitches. Such features can be obtained from a spectrogram by converting the linear frequency axis (measured in Hertz) into a logarithmic axis (measured in pitches). The resulting representation is also called log-frequency spectrogram.

As starting point, we need the discrete STFT (see also Section 2.5.3 of [Müller, FMP, Springer 2015]). Let $x$ be a real-valued discrete signal obtained by equidistant sampling with respect to a fixed sampling rate $F_\mathrm{s}$ given in Hertz. Furthermore, let $\mathcal{X}$ be the discrete STFT with respect to a window $w$ of length $N\in\mathbb{N}$ and hop size $H\in\mathbb{N}$. By applying zero-padding, we can assume that the Fourier coefficients $\mathcal{X}(n,k)$ are indexed by frame parameters $n\in\mathbb{Z}$ and frequency parameters $k\in[0:K]$, where $K=N/2$ is the frequency index corresponding to the Nyquist frequency. Recall that each Fourier coefficient $\mathcal{X}(n,k)$ is associated with the physical time position $T_\mathrm{coef}(n) = nH/F_\mathrm{s}$ given in seconds and with the physical frequency

\begin{equation} F_\mathrm{coef}(k) = \frac{k \cdot F_\mathrm{s}}{N}. \end{equation}

The main idea of the log-frequency spectrogram is to redefine the frequency axis to correspond to the logarithmically spaced frequency distribution of the equal-tempered scale. Identifying pitches with MIDI note numbers (where the note A4 corresponds to MIDI note number $p=69$), the center frequencies are given by (see Section 1.3.2 of [Müller, FMP, Springer 2015]):

\begin{equation} F_\mathrm{pitch}(p) = 2^{(p-69)/12} \cdot 440. \end{equation}

As an illustration, we consider a chromatic scale played on a piano starting with the note A0 ($p=21$) and ending with C8 ($p=108$). The resulting spectrogram reveals the exponential dependency of the fundamental frequency on the pitches of the played notes. Also, the harmonics and the notes' onset positions (vertical structures) are clearly visible. In the following, the chromatic scale will serve as one of our running examples.

import os
import numpy as np
import scipy
import matplotlib
from matplotlib import pyplot as plt
from numba import jit
import librosa
import pandas as pd
import IPython.display as ipd

import sys
sys.path.append('..')
import libfmp.c2

%matplotlib inline

# Load wav
fn_wav = os.path.join('..', 'data', 'C3', 'FMP_C3_F03.wav')
Fs = 22050
x, Fs = librosa.load(fn_wav, sr=Fs)

# Compute Magnitude STFT
N = 4096
H = 1024
X, T_coef, F_coef = libfmp.c2.stft_convention_fmp(x, Fs, N, H)
Y = np.abs(X) ** 2

# Plot spectrogram
fig = plt.figure(figsize=(10, 4))
eps = np.finfo(float).eps
plt.imshow(10 * np.log10(eps + Y), origin='lower', aspect='auto', cmap='gray_r', 
           extent=[T_coef[0], T_coef[-1], F_coef[0], F_coef[-1]])
plt.clim([-30, 30])
plt.ylim([0, 4500])
plt.xlabel('Time (seconds)')
plt.ylabel('Frequency (Hz)')
cbar = plt.colorbar()
cbar.set_label('Magnitude (dB)')
plt.tight_layout()

# Plot rectangle corresponding to pitch C3 (p=48)
rect = matplotlib.patches.Rectangle((29.3, 0.5), 1.2, 4490, linewidth=3, 
                                    edgecolor='r', facecolor='none')
plt.gca().add_patch(rect)
plt.text(28, -400, r'$\mathrm{C3}$', color='r', fontsize='x-large');

Logarithmic Frequency Pooling¶

The logarithmic perception of frequency motivates the use of a time–frequency representation with a logarithmic frequency axis labeled by the pitches of the equal-tempered scale. To derive such a representation from a given spectrogram representation, the basic idea is to assign each spectral coefficient $\mathcal{X}(n,k)$ to the pitch with a center frequency that is closest to the frequency $F_\mathrm{coef}(k)$. More precisely, we define for each pitch $p\in[0:127]$ the set

\begin{equation} P(p) := \{k:F_\mathrm{pitch}(p-0.5) \leq F_\mathrm{coef}(k) < F_\mathrm{pitch}(p+0.5)\}. \end{equation}

The frequency range covered by the set $P(p)$ depends on the frequency in a logarithmic fashion. We define the bandwidth $\mathrm{BW}(p)$ of pitch $p$ by

\begin{equation} \mathrm{BW}(p):=F_\mathrm{pitch}(p+0.5)-F_\mathrm{pitch}(p-0.5). \end{equation}

The bandwidth $\mathrm{BW}(p)$ becomes smaller for decreasing pitches. In particular, it halves when decreasing the pitch by an octave. For example, for MIDI pitch $p=66$ one has a bandwidth of roughly $21.4~\mathrm{Hz}$, whereas for $p=54$ the bandwidth falls below $10.7~\mathrm{Hz}$. The following table shows various notes and their MIDI note number $p$, center frequency $F_\mathrm{pitch}(p)$, cutoff frequencies $F_\mathrm{pitch}(p-0.5)$ and $F_\mathrm{pitch}(p+0.5)$, and bandwidth $\mathrm{BW}(p)$.

def note_name(p):
    """Returns note name of pitch

    Notebook: C3/C3S1_SpecLogFreq-Chromagram.ipynb

    Args:
        p (int): Pitch value

    Returns:
        name (str): Note name
    """
    chroma = ['A', 'A$^\\sharp$', 'B', 'C', 'C$^\\sharp$', 'D', 'D$^\\sharp$', 'E', 'F', 'F$^\\sharp$', 'G',
              'G$^\\sharp$']
    name = chroma[(p - 69) % 12] + str(p // 12 - 1)
    return name

f_pitch = lambda p: 440 * 2 ** ((p - 69) / 12)
   
note_infos = []
for p in range(60, 73):
    name = note_name(p)
    p_pitch = f_pitch(p)
    p_pitch_lower = f_pitch(p - 0.5)
    p_pitch_upper = f_pitch(p + 0.5)
    bw = p_pitch_upper - p_pitch_lower
    note_infos.append([name, p, p_pitch, p_pitch_lower, p_pitch_upper, bw])

df = pd.DataFrame(note_infos, columns=['Note', '$p$', 
                                       '$F_\mathrm{pitch}(p)$', 
                                       '$F_\mathrm{pitch}(p-0.5)$', 
                                       '$F_\mathrm{pitch}(p+0.5)$', 
                                       '$\mathrm{BW}(p)$'])


html = df.to_html(index=False, float_format='%.2f')
html = html.replace('<table', '<table style="width: 80%"')
ipd.HTML(html)

Log-Frequency Spectrogram¶

Based on the sets $P(p)$, we obtain a log-frequency spectrogram $\mathcal{Y}_\mathrm{LF}:\mathbb{Z}\times [0:127]$ using a simple pooling procedure:

\begin{equation} \mathcal{Y}_\mathrm{LF}(n,p) := \sum_{k \in P(p)}{|\mathcal{X}(n,k)|^2}. \end{equation}

By this procedure, the frequency axis is partitioned logarithmically and labeled linearly according to MIDI pitches. The following code example shows the resulting log-frequency spectrogram, where the played notes of the chromatic scale now appear in a linearly increasing fashion.

@jit(nopython=True)
def f_pitch(p, pitch_ref=69, freq_ref=440.0):
    """Computes the center frequency/ies of a MIDI pitch

    Notebook: C3/C3S1_SpecLogFreq-Chromagram.ipynb

    Args:
        p (float): MIDI pitch value(s)
        pitch_ref (float): Reference pitch (default: 69)
        freq_ref (float): Frequency of reference pitch (default: 440.0)

    Returns:
        freqs (float): Frequency value(s)
    """
    return 2 ** ((p - pitch_ref) / 12) * freq_ref

@jit(nopython=True)
def pool_pitch(p, Fs, N, pitch_ref=69, freq_ref=440.0):
    """Computes the set of frequency indices that are assigned to a given pitch

    Notebook: C3/C3S1_SpecLogFreq-Chromagram.ipynb

    Args:
        p (float): MIDI pitch value
        Fs (scalar): Sampling rate
        N (int): Window size of Fourier fransform
        pitch_ref (float): Reference pitch (default: 69)
        freq_ref (float): Frequency of reference pitch (default: 440.0)

    Returns:
        k (np.ndarray): Set of frequency indices
    """
    lower = f_pitch(p - 0.5, pitch_ref, freq_ref)
    upper = f_pitch(p + 0.5, pitch_ref, freq_ref)
    k = np.arange(N // 2 + 1)
    k_freq = k * Fs / N  # F_coef(k, Fs, N)
    mask = np.logical_and(lower <= k_freq, k_freq < upper)
    return k[mask]

@jit(nopython=True)
def compute_spec_log_freq(Y, Fs, N):
    """Computes a log-frequency spectrogram

    Notebook: C3/C3S1_SpecLogFreq-Chromagram.ipynb

    Args:
        Y (np.ndarray): Magnitude or power spectrogram
        Fs (scalar): Sampling rate
        N (int): Window size of Fourier fransform

    Returns:
        Y_LF (np.ndarray): Log-frequency spectrogram
        F_coef_pitch (np.ndarray): Pitch values
    """
    Y_LF = np.zeros((128, Y.shape[1]))
    for p in range(128):
        k = pool_pitch(p, Fs, N)
        Y_LF[p, :] = Y[k, :].sum(axis=0)
    F_coef_pitch = np.arange(128)
    return Y_LF, F_coef_pitch

Y_LF, F_coef_pitch = compute_spec_log_freq(Y, Fs, N)        

fig = plt.figure(figsize=(10, 4))
plt.imshow(10 * np.log10(eps + Y_LF), origin='lower', aspect='auto', cmap='gray_r', 
           extent=[T_coef[0], T_coef[-1], 0, 127])
plt.clim([-10, 50])
plt.ylim([21, 108])
plt.xlabel('Time (seconds)')
plt.ylabel('Frequency (pitch)')
cbar = plt.colorbar()
cbar.set_label('Magnitude (dB)')

plt.tight_layout()

# Create a Rectangle patch
rect = matplotlib.patches.Rectangle((29.3, 21), 1.2, 86.5, linewidth=3, edgecolor='r', facecolor='none')
plt.gca().add_patch(rect)
plt.text(28, 15, r'$\mathrm{C3}$', color='r', fontsize='x-large');

Looking at the spectrogram visualization, one can make some interesting observations:

As a general trend, the sounds for higher notes possess a cleaner harmonic spectrum than the ones for lower notes. For lower notes, the signal's energy is often contained in the higher harmonics, while the listener may still have the perception of a low-pitched sound.
The vertical stripes (along the frequency axis) shown by the spectrogram indicate that some of the signal's energy is spread over large parts of the spectrum. The main reason for the energy spread is due to the inharmonicities of the piano sound caused by the keystroke (mechanical noise) as well as transient and resonance effects.
Furthermore, the frequency content of a sound depends on the microphone's frequency response. For example, the microphone may capture only frequencies above a certain threshold as in the case of our audio example. This also may explain why there is virtually no energy visible in the fundamental frequencies for the notes A0 ($p=21$) to B0 ($p=32$).

Besides acoustic properties, there is another reason for the rather poor representation of low pitches when using the pooling strategy based on a discrete STFT. While the discrete STFT introduces a linear sampling of the frequency axis, the bandwidth used in the pooling strategy depends on the frequency in a logarithmic fashion. As a result, the set $P(p)$ may contain only very few spectral coefficients or may even be empty for small values of $p$ (which is the reason for the horizontal white stripes in the figure above). This is also demonstrated by the following code example.

print('Sampling rate: Fs = ', Fs)
print('Window size: N = ', N)
print('STFT frequency resolution (in Hz): Fs/N = %4.2f' % (Fs / N))

for p in [76, 64, 52, 40, 39, 38]:
    print('Set P(%d) = %s' % (p, pool_pitch(p, Fs, N)))

Sampling rate: Fs =  22050
Window size: N =  4096
STFT frequency resolution (in Hz): Fs/N = 5.38
Set P(76) = [119 120 121 122 123 124 125 126]
Set P(64) = [60 61 62 63]
Set P(52) = [30 31]
Set P(40) = [15]
Set P(39) = []
Set P(38) = [14]

To resolve the issues of having an insufficient frequency resolution (in particular for low pitches), one may use a larger STFT window (at the cost of loosing time resolution). An alternative may be the usage of interpolation techniques or frequency refinement techniques based on instantaneous frequency estimation.

Chromagram¶

We now discuss a strategy to increase the robustness of the log-frequency spectrogram to variations in timbre and instrumentation. The main idea is to suitably combine pitch bands corresponding to pitches that differ by one or several octaves. As discussed in the FMP notebook on musical notes and pitches (Section 1.1.1 of [Müller, FMP, Springer 2015]), the human perception of pitch is periodic in the sense that two pitches are perceived as similar in "color" (playing a similar harmonic role) if they differ by an octave. Based on this observation, a pitch can be separated into two components, which are referred to as tone height and chroma. The tone height refers to the octave number and the chroma to the respective pitch spelling attribute contained in the set

$$ \{\mathrm{C},\mathrm{C}^\sharp,\mathrm{D},\mathrm{D}^\sharp,\ldots,\mathrm{B}\}. $$

Enumerating the chroma values, we identify this set with $[0:11]$ where $0$ refers to chroma $\mathrm{C}$, $1$ to $\mathrm{C}^\sharp$, and so on. A pitch class is defined as the set of all pitches that share the same chroma. For example, the pitch class corresponding to the chroma $\mathrm{C}$ is the set $\{\ldots,\,\mathrm{C0},\mathrm{C1},\mathrm{C2},\mathrm{C3},\ldots\}$ consisting of all pitches separated by an integer number of octaves. For simplicity, we use the terms chroma and pitch class interchangeably.

The main idea of chroma features is to aggregate all spectral information that relates to a given pitch class into a single coefficient. Given a pitch-based log-frequency spectrogram $\mathcal{Y}_\mathrm{LF}:\mathbb{Z}\times[0:127]\to \mathbb{R}_{\geq 0}$, a chroma representation or chromagram $\mathbb{Z}\times[0:11]\to \mathbb{R}_{\geq 0}$ can be derived by summing up all pitch coefficients that belong to the same chroma:

\begin{equation} \mathcal{C}(n,c) := \sum_{\{p \in [0:127]\,:\,p\,\mathrm{mod}\,12 = c\}}{\mathcal{Y}_\mathrm{LF}(n,p)} \end{equation}

for $c\in[0:11]$. Continuing our example, the following code example generates a chromagram of the chromatic scale, where the cyclic nature of chroma features becomes evident. Because of the octave equivalence, the increasing notes of the chromatic scale are "wrapped around" the chroma axis. As with the log-frequency spectrogram, the resulting chromagram of the considered audio example is rather noisy, in particular for the lower notes. Furthermore, because of the presence of higher harmonics, the energy is typically spread across various chroma bands even when playing a single note at a time. For example, playing the note C3, the third harmonic corresponds to G4 and the fifth harmonic to E5. Therefore, when playing the note C3 on the piano, not only the chroma band $\mathrm{C}$, but also the chroma bands $\mathrm{G}$ and $\mathrm{E}$ contain a substantial portion of the signal's energy.

@jit(nopython=True)
def compute_chromagram(Y_LF):
    """Computes a chromagram

    Notebook: C3/C3S1_SpecLogFreq-Chromagram.ipynb

    Args:
        Y_LF (np.ndarray): Log-frequency spectrogram

    Returns:
        C (np.ndarray): Chromagram
    """
    C = np.zeros((12, Y_LF.shape[1]))
    p = np.arange(128)
    for c in range(12):
        mask = (p % 12) == c
        C[c, :] = Y_LF[mask, :].sum(axis=0)
    return C

C = compute_chromagram(Y_LF)

fig = plt.figure(figsize=(10, 3))
chroma_label = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
plt.imshow(10 * np.log10(eps + C), origin='lower', aspect='auto', cmap='gray_r', 
           extent=[T_coef[0], T_coef[-1], 0, 12])
plt.clim([0, 60])
plt.xlabel('Time (seconds)')
plt.ylabel('Chroma')
cbar = plt.colorbar()
cbar.set_label('Magnitude (dB)')
plt.yticks(np.arange(12) + 0.5, chroma_label)
plt.tight_layout()

rect = matplotlib.patches.Rectangle((29.3, 0.0), 1.2, 12, linewidth=3, edgecolor='r', facecolor='none')
plt.gca().add_patch(rect)
plt.text(28.5, -1.2, r'$\mathrm{C3}$', color='r', fontsize='x-large');

Example: Burgmüller¶

As an illustrating example, we now consider the first four measures of Op. 100, No. 2 by Friedrich Burgmüller. In the lower staff of the score (left hand), one can see that the chord consisting of the three notes $\mathrm{A3}$ ($p=57$), $\mathrm{C4}$ ($p=60$), and $\mathrm{E4}$ ($p=64$) is played every quarter beat—altogether eight times over the first four measures. These chords are also clearly visible in the the log-frequency spectrogram and chromagram shown below. Furthermore, the patterns resulting from the two sixteenth-note phrases (upper staff of the score, right hand) are clearly revealed. Looking at the visualizations, it is important to note that inharmonicities and partials may result in substantial contributions in certain frequency and chroma bands not relating to the fundamental frequencies of the notes shown in the score. Another important issue is the imperfection of the Fourier analysis, also known as spectral leakage, which is the result of the frequency smearing introduced by the window function.

fn_wav = os.path.join('..', 'data', 'C3', 'FMP_C3_F05.wav')
Fs = 22050
x, Fs = librosa.load(fn_wav, Fs)
# ipd.display(ipd.Audio(x, rate=Fs))

N = 4096
H = 512
X, T_coef, F_coef = libfmp.c2.stft_convention_fmp(x, Fs, N, H)
eps = np.finfo(float).eps
Y = np.abs(X) ** 2
Y_LF, F_coef_pitch = compute_spec_log_freq(Y, Fs, N)
C = compute_chromagram(Y_LF)

fig = plt.figure(figsize=(8,3))
plt.imshow(10 * np.log10(eps + Y_LF), origin='lower', aspect='auto', cmap='gray_r', 
           extent=[T_coef[0], T_coef[-1], 0, 128])
plt.clim([-10, 50])
plt.ylim([55, 92])
plt.xlabel('Time (seconds)')
plt.ylabel('Frequency (pitch)')
cbar = plt.colorbar()
cbar.set_label('Magnitude (dB)')
plt.tight_layout()

fig = plt.figure(figsize=(8, 2.5))
plt.imshow(10 * np.log10(eps + C), origin='lower', aspect='auto', cmap='gray_r', 
           extent=[T_coef[0], T_coef[-1], 0, 12])
plt.clim([0, 50])
plt.xlabel('Time (seconds)')
plt.ylabel('Chroma')
cbar = plt.colorbar()
cbar.set_label('Magnitude (dB)')
plt.yticks(np.arange(12) + 0.5, chroma_label)
plt.tight_layout()

libfmp Implementation¶

The basic functions for computing a log-frequency spectrogram and a chromagram have been included into libfmp. In the following code cell, we call these libfmp-functions. Prior to using these functions, one needs to compute a spectrogram, see the FMP notebook on the STFT and the FMP notebook on STFT conventions. Furthermore, we use visualization functions introduced in the FMP notebook on Python visualization).

import libfmp.c3

fn_wav = os.path.join('..', 'data', 'C3', 'FMP_C3_F05.wav')
x, Fs = librosa.load(fn_wav)

N, H = 4096, 512
X, T_coef, F_coef = libfmp.c2.stft_convention_fmp(x, Fs, N, H)
Y = np.abs(X) ** 2
Y_LF, F_coef_pitch = libfmp.c3.compute_spec_log_freq(Y, Fs, N)
C = libfmp.c3.compute_chromagram(Y_LF)

fig, ax = plt.subplots(2, 2, gridspec_kw={'width_ratios': [1, 0.02], 
                                          'height_ratios': [3, 2]}, figsize=(8, 5))  

libfmp.b.plot_matrix(10 * np.log10(eps + Y_LF), Fs=Fs/H, ax=[ax[0,0], ax[0,1]], 
        ylim=[55,92], clim=[0, 50], title='Log-frequency spectrogram', 
        ylabel='Frequency (pitch)', colorbar=True, cbar_label='Magnitude (dB)');

libfmp.b.plot_chromagram(10 * np.log10(eps + C), Fs=Fs/H, ax=[ax[1,0], ax[1,1]],  
        chroma_yticks = [0,4,7,11], clim=[10, 50], title='Chromagram', 
        ylabel='Chroma', colorbar=True, cbar_label='Magnitude (dB)');

plt.tight_layout()

Further Notes¶

Obviously, there is music that is not based on the equal temperament and the tuning may deviate from the standard tuning. For details on how to compensate for tuning deviations, we refer to the FMP notebook on tuning and transposition.
There are many variants for computing log-frequency spectrograms and chromagrams. We will encounter many of these variants in the subsequent notebooks including:
LibROSA also offers various functionalities for computing and visualizing spectrograms, chromagrams, and other feature representations. In the following code cell, we provide an example the case of a chromagram.

import librosa, librosa.display
C = librosa.feature.chroma_stft(y=x, sr=Fs, tuning=0, norm=None, hop_length=H, n_fft=N)
plt.figure(figsize=(8, 2))
librosa.display.specshow(10 * np.log10(eps + C), x_axis='time', 
                         y_axis='chroma', sr=Fs, hop_length=H)
plt.colorbar();

Acknowledgment: This notebook was created by Frank Zalkow and Meinard Müller.

Note	$p$	$F_\mathrm{pitch}(p)$	$F_\mathrm{pitch}(p-0.5)$	$F_\mathrm{pitch}(p+0.5)$	$\mathrm{BW}(p)$
C4	60	261.63	254.18	269.29	15.11
C$^\sharp$4	61	277.18	269.29	285.30	16.01
D4	62	293.66	285.30	302.27	16.97
D$^\sharp$4	63	311.13	302.27	320.24	17.97
E4	64	329.63	320.24	339.29	19.04
F4	65	349.23	339.29	359.46	20.18
F$^\sharp$4	66	369.99	359.46	380.84	21.37
G4	67	392.00	380.84	403.48	22.65
G$^\sharp$4	68	415.30	403.48	427.47	23.99
A4	69	440.00	427.47	452.89	25.42
A$^\sharp$4	70	466.16	452.89	479.82	26.93
B4	71	493.88	479.82	508.36	28.53
C5	72	523.25	508.36	538.58	30.23