Following Section 8.2 and Section 8.2.3.3 of [Müller, FMP, Springer 2015], we cover in this notebook two related problems referred to as melody extraction and melody separation. The overall approach was suggested by Salamon and Gómez:
When asked to describe a specific song, we are often able to sing or hum the main melody. In general terms, a melody may be defined as a linear succession of musical tones expressing a particular musical idea. Because of the special arrangement of tones, a melody is perceived as a coherent entity, which gets stuck in a listener's head as the most memorable element of a song. As the original Greek term melōidía (meaning "singing" or "chanting") implies, a melody is often performed by a human voice. Oftentimes, the melody constitutes the leading element in a composition, appearing in the foreground, while the accompaniment is in the background. Sometimes melody and accompaniment may even be played on a single instrument such as a guitar or a piano. In any case, the melody typically stands out in one way or another. For example, the melody often comprises the higher notes in a musical composition, while the accompaniment consists of the lower notes. Or the melody is played by some instrument with a characteristic timbre . In some performances, the notes of a melody may feature easily discernible time–frequency patterns such as vibrato, tremolo, or glissando (i.e., a continuous glide from one pitch to another). As with many other concepts in music processing, the notion of melody remains rather vague. In music processing, one often considers a restricted scenario. In the FMP notebook on F0 tracking, for example, we make the following simplifying assumptions:
Based on these assumptions, the melody extraction problem can be regarded as the following signal processing task: Given a recording, the objective is to automatically estimate the sequence of predominant F0-values that correspond to the notes played by the lead voice or instrument. This is exactly the problem we discussed in the FMP notebook on F0 tracking. The basic procedure can be summarized as follows:
The melody extraction problem is challenging even in the restricted case that the simplifying assumptions are valid.
Continuing our Freischütz example, the following figure shows a spectrogram representation as well as the F0-trajectory of the melody performed by a soprano singer. Note the following:
The singer is not always active. This is also indicated by the activity pattern plotted below the spectrogram.
At some time points, the spectral bin with the highest magnitude may belong to the accompaniment. This is one reason for errors in melody estimation.
Some of the higher harmonics may contain more energy than the fundamental frequency, which may lead to octave errors in melody estimation.
import numpy as np
import os, sys, librosa
from scipy import signal
from scipy import ndimage
from scipy.interpolate import interp1d
import matplotlib
from matplotlib import pyplot as plt
from matplotlib.colors import LinearSegmentedColormap, BoundaryNorm
from matplotlib import gridspec
import IPython.display as ipd
from numba import jit
import pandas as pd
from collections import OrderedDict
sys.path.append('..')
import libfmp.b
import libfmp.c3
import libfmp.c8
%matplotlib inline
def plot_STFT_F0_activity(Y, traj, figsize=(7,4), ylim = [0, 4000], cmap='gray_r'):
fig, ax = plt.subplots(2, 2, gridspec_kw={'width_ratios': [10, 0.2],
'height_ratios': [10, 1]}, figsize= figsize)
fig_im, ax_im, im = libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs,
ax=[ax[0,0], ax[0,1]], title='', cmap=cmap);
ax[0,0].set_ylim(ylim)
traj_plot = traj[traj[:, 1] > 0, :]
ax[0,0].plot(traj_plot[:, 0], traj_plot[:, 1], color='r', markersize=4, marker='.', linestyle='');
# F0 activity
activity = np.array(traj[:, 1] > 0).reshape(1, -1)
im = ax[1,0].imshow(activity, aspect='auto', cmap='bwr', clim=[-1, 1])
ax[1,0].set_xticks([])
ax[1,0].set_yticks([])
ax[1,1].axis('off')
plt.tight_layout()
return fig, ax
# Load audio
fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_F10_Weber_Freischuetz-06_FreiDi-35-40.wav')
Fs = 22050
x, Fs = librosa.load(fn_wav, sr=Fs)
ipd.Audio(x, rate=Fs)
# Computation of STFT
N = 2048
H = N//4
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, pad_mode='constant')
Y = np.log(1 + np.abs(X))
Fs_feature = Fs/H
T_coef = np.arange(X.shape[1]) / Fs_feature
freq_res = Fs / N
F_coef = np.arange(X.shape[0]) * freq_res
#F_coef = libfmp.c3.F_coef(np.arange(Y.shape[0]), Fs, N)
# Load F0 trajectory and and resample to fit to X
fn_traj = os.path.join('..', 'data', 'C8', 'FMP_C8_F10_Weber_Freischuetz-06_FreiDi-35-40_F0-user-Book.csv')
traj_df = libfmp.b.read_csv(fn_traj)
traj = traj_df.values
Fs_traj = 1/(traj[1,0]-traj[0,0])
traj_X_values = interp1d(traj[:,0], traj[:,1], kind='nearest', fill_value='extrapolate')(T_coef)
traj_X = np.hstack((T_coef[:,None], traj_X_values[:,None]))
# Visualization
figsize=(7,4)
ylim = [0, 4000]
fig, ax = plot_STFT_F0_activity(Y, traj_X, figsize=figsize, ylim=ylim)
plt.show()
# Sonification
soni_mono = libfmp.c8.sonify_trajectory_with_sinusoid(traj_X, len(x), Fs)
soni_stereo = np.vstack((x.reshape(1,-1), soni_mono.reshape(1,-1))) # left: x, right: sonification
ipd.Audio(soni_stereo, rate=Fs)
In our simplified scenario, the objective of melody extraction is to estimate the sequence of F0-values that correspond to the main melody. A related task is referred to as melody separation, which aims at decomposing a music signal into a melody component that captures the main melodic voice and an accompaniment component that captures the remaining acoustic events. One application of melody separation is the automated generation of a Karaoke version for a given song, where the main melodic voice is to be removed from the song's original recording. More technically speaking, the objective of melody separation is to decompose a given signal $x$ into a melody component $x^\mathrm{Mel}$ and an accompaniment component $x^\mathrm{Acc}$ such that
\begin{equation} x = x^\mathrm{Mel} + x^\mathrm{Acc}. \end{equation}One general approach is to first apply a melody extraction algorithm to derive the F0-trajectory of the main melody. Based on these F0-values and their harmonics, one can construct a binary mask for the melodic component, while the binary mask of the accompaniment is defined to be the complement. Using masking techniques as described in the FMP notebook on HPS (see Section 8.1.1.2 of [Müller, FMP, Springer 2015]), one then derives the two masked STFTs $\mathcal{X}^\mathrm{Mel}$ and $\mathcal{X}^\mathrm{Acc}$. From this, one obtains the signals $x^\mathrm{Mel}$ and $x^\mathrm{Acc}$ by applying signal reconstruction techniques (see Section 8.1.2 of [Müller, FMP, Springer 2015]). We now implement this melody separation procedure and apply it to the Freischütz example.
There are many ways to define a binary mask for a given F0-trajectory. Taking harmonics into account, we provide in the following code cell two variants.
tol_bin
(given in frequency bins), which is used to define a frequency neighborhood of fixed size around each frequency bin.tol_cents
(given in cents), which is used to define a neighborhood of frequency-dependent size around each frequency bin. More precisely, the neighborhood size increases linearly with the center frequency (so that it remains constant when measured in cents).In the following figure, we show the binary masks for the melody (as well as their complements for the accompaniment) for both variants.
def convert_trajectory_to_mask_bin(traj, F_coef, n_harmonics=1, tol_bin=0):
"""Computes binary mask from F0 trajectory
Notebook: C8/C8S2_MelodyExtractSep.ipynb
Args:
traj (np.ndarray): F0 trajectory (time in seconds in 1st column, frequency in Hz in 2nd column)
F_coef (np.ndarray): Frequency axis
n_harmonics (int): Number of harmonics (Default value = 1)
tol_bin (int): Tolerance in frequency bins (Default value = 0)
Returns: