FMP AudioLabs
C8

Melody Extraction and Separation


Following Section 8.2 and Section 8.2.3.3 of [Müller, FMP, Springer 2015], we cover in this notebook two related problems referred to as melody extraction and melody separation. The overall approach was suggested by Salamon and Gómez:

  • Justin Salamon and Emilia Gómez: Melody Extraction from Polyphonic Music Signals using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (2012), pp. 1759–1770.
    Bibtex

Melody

When asked to describe a specific song, we are often able to sing or hum the main melody. In general terms, a melody may be defined as a linear succession of musical tones expressing a particular musical idea. Because of the special arrangement of tones, a melody is perceived as a coherent entity, which gets stuck in a listener's head as the most memorable element of a song. As the original Greek term melōidía (meaning "singing" or "chanting") implies, a melody is often performed by a human voice. Oftentimes, the melody constitutes the leading element in a composition, appearing in the foreground, while the accompaniment is in the background. Sometimes melody and accompaniment may even be played on a single instrument such as a guitar or a piano. In any case, the melody typically stands out in one way or another. For example, the melody often comprises the higher notes in a musical composition, while the accompaniment consists of the lower notes. Or the melody is played by some instrument with a characteristic timbre . In some performances, the notes of a melody may feature easily discernible time–frequency patterns such as vibrato, tremolo, or glissando (i.e., a continuous glide from one pitch to another). As with many other concepts in music processing, the notion of melody remains rather vague. In music processing, one often considers a restricted scenario. In the FMP notebook on F0 tracking, for example, we make the following simplifying assumptions:

  • First, we consider the scenario where the music is given in the form of an audio recording (and not as a symbolic music representation).
  • Second, rather than estimating a sequence of notes, we represent the melody by a sequence of F0-values that correspond to the notes' pitches.
  • Third, we restrict ourselves to music where the melody is predominant (e.g., being performed by a lead singer or a lead instrument).
  • Forth, we assume that there is only one melody line at a time, which can be associated to a single sound source.

Melody Extraction

Based on these assumptions, the melody extraction problem can be regarded as the following signal processing task: Given a recording, the objective is to automatically estimate the sequence of predominant F0-values that correspond to the notes played by the lead voice or instrument. This is exactly the problem we discussed in the FMP notebook on F0 tracking. The basic procedure can be summarized as follows:

  • First, compute a salience representation.
  • Then, derive the trajectory of predominant F0-values using suitable smoothness or score-informed constraints.
  • Also, as suggested by Salamon and Gómez, one may introduce other musical constraints, e.g., based on pitch contours.

The melody extraction problem is challenging even in the restricted case that the simplifying assumptions are valid.

  • In music with many instruments playing simultaneously, it is hard to attribute specific time–frequency patterns to notes of individual instruments.
  • Also, the presence of resonance and reverberation effects further increases the overlap of different sound sources.
  • Furthermore, even after a successful estimation of fundamental frequencies, one still has to determine which of the F0-values belong to the predominant melody and which are part of the accompaniment.

Continuing our Freischütz example, the following figure shows a spectrogram representation as well as the F0-trajectory of the melody performed by a soprano singer. Note the following:

  • The singer is not always active. This is also indicated by the activity pattern plotted below the spectrogram.

  • At some time points, the spectral bin with the highest magnitude may belong to the accompaniment. This is one reason for errors in melody estimation.

  • Some of the higher harmonics may contain more energy than the fundamental frequency, which may lead to octave errors in melody estimation.

In [1]:
import numpy as np
import os, sys, librosa
from scipy import signal
from scipy import ndimage
from scipy.interpolate import interp1d
import matplotlib
from matplotlib import pyplot as plt
from matplotlib.colors import LinearSegmentedColormap, BoundaryNorm
from matplotlib import gridspec
import IPython.display as ipd
from numba import jit
import pandas as pd
from collections import OrderedDict

sys.path.append('..')
import LibFMP.B
import LibFMP.C3
import LibFMP.C8

%matplotlib inline

def plot_STFT_F0_activity(Y, traj, figsize=(7,4), ylim = [0, 4000], cmap='gray_r'):
    fig, ax = plt.subplots(2, 2, gridspec_kw={'width_ratios': [10, 0.2],
                                        'height_ratios': [10, 1]}, figsize= figsize)    
    
    fig_im, ax_im, im = LibFMP.B.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs, 
                                            ax=[ax[0,0], ax[0,1]], title='', cmap=cmap);
    ax[0,0].set_ylim(ylim)
    traj_plot = traj[traj[:, 1] > 0, :]
    ax[0,0].plot(traj_plot[:, 0], traj_plot[:, 1], color='r', markersize=4, marker='.', linestyle='');
        
    # F0 activity
    activity = np.array(traj[:, 1] > 0).reshape(1, -1)
    im = ax[1,0].imshow(activity, aspect='auto', cmap='bwr', clim=[-1, 1])
    ax[1,0].set_xticks([])
    ax[1,0].set_yticks([])
    ax[1,1].axis('off')
    plt.tight_layout()
    return fig, ax

# Load audio
fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_F10_Weber_Freischuetz-06_FreiDi-35-40.wav')
Fs = 22050
x, Fs = librosa.load(fn_wav, sr=Fs)
ipd.Audio(x, rate=Fs)

# Computation of STFT
N = 2048
H = N//4
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, pad_mode='constant')
Y = np.log(1 + np.abs(X))
Fs_feature = Fs/H
T_coef = np.arange(X.shape[1]) / Fs_feature
freq_res = Fs / N
F_coef = np.arange(X.shape[0]) * freq_res
#F_coef = LibFMP.C3.F_coef(np.arange(Y.shape[0]), Fs, N)

# Load F0 trajectory and and resample to fit to X
fn_traj = os.path.join('..', 'data', 'C8', 'FMP_C8_F10_Weber_Freischuetz-06_FreiDi-35-40_F0-user-Book.csv')
traj_df = LibFMP.B.read_csv(fn_traj)
traj = traj_df.values
Fs_traj = 1/(traj[1,0]-traj[0,0])
traj_X_values = interp1d(traj[:,0], traj[:,1], kind='nearest', fill_value='extrapolate')(T_coef)
traj_X = np.hstack((T_coef[:,None], traj_X_values[:,None]))

# Visualization
figsize=(7,4)
ylim = [0, 4000]
fig, ax = plot_STFT_F0_activity(Y, traj_X, figsize=figsize, ylim=ylim)
plt.show()

# Sonification
soni_mono = LibFMP.C8.sonify_trajectory_with_sinusoid(traj_X, len(x), Fs)
soni_stereo = np.vstack((x.reshape(1,-1), soni_mono.reshape(1,-1)))  # left: x , right: sonification
ipd.Audio(soni_stereo, rate=Fs)
Out[1]:

Melody Separation: Overall Idea

In our simplified scenario, the objective of melody extraction is to estimate the sequence of F0-values that correspond to the main melody. A related task is referred to as melody separation, which aims at decomposing a music signal into a melody component that captures the main melodic voice and an accompaniment component that captures the remaining acoustic events. One application of melody separation is the automated generation of a Karaoke version for a given song, where the main melodic voice is to be removed from the song's original recording. More technically speaking, the objective of melody separation is to decompose a given signal $x$ into a melody component $x^\mathrm{Mel}$ and an accompaniment component $x^\mathrm{Acc}$ such that

\begin{equation} x = x^\mathrm{Mel} + x^\mathrm{Acc}. \end{equation}

One general approach is to first apply a melody extraction algorithm to derive the F0-trajectory of the main melody. Based on these F0-values and their harmonics, one can construct a binary mask for the melodic component, while the binary mask of the accompaniment is defined to be the complement. Using masking techniques as described in the FMP notebook on HPS (see Section 8.1.1.2 of [Müller, FMP, Springer 2015]), one then derives the two masked STFTs $\mathcal{X}^\mathrm{Mel}$ and $\mathcal{X}^\mathrm{Acc}$. From this, one obtains the signals $x^\mathrm{Mel}$ and $x^\mathrm{Acc}$ by applying signal reconstruction techniques (see Section 8.1.2 of [Müller, FMP, Springer 2015]). We now implement this melody separation procedure and apply it to the Freischütz example.

Binary Mask

There are many ways to define a binary mask for a given F0-trajectory. Taking harmonics into account, we provide in the following code cell two variants.

  • In the first variant, one can specify a tolerance parameters tol_bin (given in frequency bins), which is used to define a frequency neighborhood of fixed size around each frequency bin.
  • In the second variant, one can specify a tolerance parameters tol_cents (given in cents), which is used to define a neighborhood of frequency-dependent size around each frequency bin. More precisely, the neighborhood size increases linearly with the center frequency (so that it remains constant when measured in cents).

In the following figure, we show the binary masks for the melody (as well as their complements for the accompaniment) for both variants.

In [2]:
def convert_trajectory_to_mask_bin(traj, F_coef, n_harmonics=1, tol_bin=0):
    """Computes binary mask from F0 trajectory
    
    Notebook: C8/C8S2_MelodyExtractSep.ipynb
    
    Args:
        traj: F0 trajectory (time in seconds in 1st column, frequency in Hz in 2nd column)
        F_coef: Frequency axis
        n_harmonics: Number of harmonics
        tol_bin: Tolerance in frequency bins
        
    Returns:
        mask: Binary mask
    """ 
    # Compute STFT bin for trajectory 
    traj_bin = np.argmin(np.abs(F_coef[:, None] - traj[:, 1][None, :]), axis=0)

    K = len(F_coef)
    N = traj.shape[0]
    max_idx_harm =