FMP AudioLabs
C8

NMF-Based Audio Decomposition


Following Section 8.3.3 of [Müller, FMP, Springer 2015], we describe a score-informed NMF procedure for decomposing an audio signal into note-based sound events. This procedure was originally described by Ewert and Müller. As shown by Driedger et al., such a decomposition can be used for audio editing applications.

  • Jonathan Driedger, Harald Grohganz, Thomas Prätzlich, Sebastian Ewert, Meinard Müller: Score-Informed Audio Decomposition and Applications. Proceedings of the ACM International Conference on Multimedia, Barcelona, Spain, 541–544, 2013.
    Bibtex Video
  • Sebastian Ewert, Bryan Pardo, Meinard Müller, Mark Plumbley: Score-Informed Source Separation for Musical Audio Recordings. IEEE Signal Processing Magazine, 31(3): 116–124, 2014.
    Bibtex

Introduction

In the FMP notebook on NMF-based spectrogram factorization, we used score annotations to constrain the template matrix (using pitches) and the activation matrix (using note start positions and note durations). This resulted in a decomposition of the magnitude spectrum $V$ into musically meaningful template matrix $W$ and activation matrix $H$. In particular, the notewise activation constraints yield a mapping between nonzero coefficients in $H$ and the score's note events. We now describe how this spectrogram decomposition can be employed to separate audio components that correspond to specific note groups. These note groups, which may encode a melody line, a certain motif, or the accompaniment, may be specified by means of a suitable labeling of the score representation.

As an illustrative scenario, we continue with the Chopin example, the beginning of the Prélude Op. 28, No. 4. Our goal is to decompose the recording into two components, where one component corresponds to the notes of the lower staff (the accompaniment played by the left hand of the pianist, indicated by the red box) and the other to the notes of the upper staff (the melody played by the right hand, indicated by yellow box).

FMP_C8_F26a


Score-Informed Spectrogram Factorization

As input to the NMF algorithm, we use a (logarithmically compressed) magnitude spectrogram. Furthermore, we use an onset model with score-informed template and activation constraints as described in Section 8.3.2.3 of [Müller, FMP, Springer 2015]. In the following code cell, we show the resulting score-informed spectrogram factorization.

In [1]:
import os
import sys
import scipy
import numpy as np
import pandas as pd
import librosa
import matplotlib.pyplot as plt
import IPython.display as ipd
sys.path.append('..')
import LibFMP.B
import LibFMP.C1
import LibFMP.C8
%matplotlib inline

Fs = 22050
fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_Audio_Chopin_Op028-04_SMD_beginning.wav')
#fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_Audio_Chopin_Op028-04_SMD.wav')

fn_ann = os.path.join('..', 'data', 'C8', 'FMP_C8_Audio_Chopin_Op028-04_SMD_beginning.csv')
#fn_ann = os.path.join('..', 'data', 'C8', 'FMP_C8_Audio_Chopin_Op028-04_SMD.csv')
annotation = LibFMP.C1.csv_to_list(fn_ann)

# Input Spectrogram
x, Fs = librosa.load(fn_wav, sr=Fs)
N_fft = 2048
H_fft = 1024
X = librosa.stft(x, n_fft=N_fft, hop_length=H_fft)
V = np.log(1 + np.abs(X))
#V = np.abs(X)
freq_res = Fs / N_fft
freq_max = 2000
frame_res = H_fft/Fs

#LibFMP.B.plot_matrix(V, Fs=Fs/H_fft, Fs_F=N_fft/Fs)
#plt.ylim([0, freq_max]);

# NMF decomposition using onset model
K = V.shape[0]
N = V.shape[1]
pitch_set = LibFMP.C8.pitch_from_annotation(annotation)
R = pitch_set.shape[0]

H_init, pitch_set, label_pitch = LibFMP.C8.init_NMF_activation_score_onset(N, annotation, frame_res, 
                                                    tol_note=[0.1, 0.5], tol_onset=[0.2, 0.1])
W_init = LibFMP.C8.init_NMF_template_pitch_onset(K, pitch_set, freq_res, tol_pitch=0.05)

W, H, V_approx, V_approx_err, H_W_error = LibFMP.C8.NMF(V, 2*R, W=W_init, H=H_init, L=100, norm=True)
#LibFMP.C8.plot_NMF_factors(W_init, H_init, W_init.dot(H_init), Fs, N_fft, H_fft, freq_max, label_pitch=label_pitch)
LibFMP.C8.plot_NMF_factors(W, H, W.dot(H), Fs, N_fft, H_fft, freq_max, label_pitch=label_pitch, 
                           title_W=r'$W$', title_H=r'$H$', title_V=r'$WH$')

Score-Informed Split of Activation Matrix

Using the mapping between the activation constraints in $H$ and the score's notes, we can split up $H$ into two matrices $H^\mathrm{L}$ and $H^\mathrm{R}$, which contain the activations for the left and for the right hand, respectively. Multiplying these two matrices with the template matrix $W$, we obtain two matrices $WH^\mathrm{L}$ and $WH^\mathrm{R}$. Intuitively, these two matrices can be regarded as the estimated magnitude STFTs for the desired components for the left and right hand, respectively. In the following code cell, we provide an implementation of this score-informed split of the activation matrix.

In [2]:
def split_annotation_lh_rh(ann):
    """Splitting of the annotation data in left and right hand
    
    Notebook: C8/C8S3_NMFAudioDecomp.ipynb
    
    Args: 
        ann: Annotation data
    
    Returns: 
        ann_lh, ann_rh: Annotation data for left and right hand
    """    
    ann = annotation
    ann_lh = []
    ann_rh = []
    for a in ann:
        if a[4] == 'lh':
            ann_lh = ann_lh + [a]
        if a[4] == 'rh':
            ann_rh = ann_rh + [a]    
    return ann_lh, ann_rh

annotation_lh, annotation_rh = split_annotation_lh_rh(annotation)

H_init_L, pitch_set_L, label_pitch_L = LibFMP.C8.init_NMF_activation_score_onset(N, annotation_lh, frame_res, 
                                            tol_note=[0.1, 0.5], tol_onset=[0.2, 0.1], pitch_set=pitch_set)
H_init_R, pitch_set_R, label_pitch_R = LibFMP.C8.init_NMF_activation_score_onset(N, annotation_rh, frame_res, 
                                            tol_note=[0.1, 0.5], tol_onset=[0.2, 0.1], pitch_set=pitch_set)

H_L = H * H_init_L
H_R = H * H_init_R

LibFMP.C8.plot_NMF_factors(W, H_R, W.dot(H_R), Fs, N_fft, H_fft, freq_max, label_pitch=label_pitch, 
                           title_W=r'$W$', title_H=r'$H^R$', title_V=r'$WH^R$')
LibFMP.C8.plot_NMF_factors(W, H_L, W.dot(H_L), Fs, N_fft, H_fft, freq_max, label_pitch=label_pitch, 
                           title_W=r'$W$', title_H=r'$H^L$', title_V=r'$WH^L$')

Spectral Masking

To obtain time-domain signals, a first idea is to simply use the phase information of the original STFT $\mathcal{X}$ and to invert the resulting modified STFTs by applying the signal reconstruction method as described in Section 8.1.2.2 of [Müller, FMP, Springer 2015]. However, NMF-based models typically yield only a rough approximation of the original magnitude spectrogram (or even a logarithmically compressed version thereof), where spectral nuances or the actual magnitudes may not be captured well. Therefore, the audio components reconstructed in this way may contain a number of audible artifacts. Some of these artifacts may be removed or attenuated by using masking techniques as discussed in Section 8.1.1.2 of [Müller, FMP, Springer 2015]. Instead of directly using $WH^\mathrm{L}$ and $WH^\mathrm{R}$ for the reconstruction, the idea is to use these matrices to first define the two soft masks

\begin{eqnarray} M^\mathrm{L} &:=& (W H^\mathrm{L})\oslash (W H + \varepsilon), \\ M^\mathrm{R} &:=& (W H^\mathrm{R})\oslash (W H + \varepsilon), \end{eqnarray}

where the operator $\oslash$ denotes pointwise division and the small positive value $\varepsilon>0$ is added to avoid division by zero. The two soft masks can then be applied to the original STFT $\mathcal{X}$ (as described in the FMP notebook on harmonic‐percussive separation). Employing an inverse STFT, one obtains the two desired time-domain component signals for the left and the right hand notes, respectively. By using the masking-based approach, many of the spectral details of the original recording are preserved, even if they are not directly captured by the factors of the NMF decomposition. This often yields more acoustically appealing results. On the downside, by filtering the original audio data, masking may also retain more nontarget spectral components compared with a direct reconstruction from $WH^\mathrm{L}$ and $WH^\mathrm{R}$.

In [3]:
eps_machine = np.finfo(np.float32).eps
M_L = W.dot(H_L) / (W.dot(H) + eps_machine)
X_L = X * M_L
x_L = librosa.istft(X_L, hop_length=H_fft, win_length=N_fft, length=x.size)

M_R = W.dot(H_R) / (W.dot(H) + eps_machine)
X_R = X * M_R
x_R = librosa.istft(X_R, hop_length=H_fft, win_length=N_fft, length=x.size)

plt.figure(figsize=(9,3))
ax = plt.subplot(1,2,1)
LibFMP.B.plot_matrix(M_L, Fs=Fs/H_fft, Fs_F=N_fft/Fs, ax=[ax], title=r'$M^L$')
ax.set_ylim([0, freq_max]);

ax = plt.subplot(1,2,2)
LibFMP.B.plot_matrix(M_R, Fs=Fs/H_fft, Fs_F=N_fft/Fs, ax=[ax], title=r'$M^R$')
ax.set_ylim([0, freq_max]);
plt.tight_layout()
plt.show()

print('Original signal:')
ipd.display( ipd.Audio(x, rate=Fs) )
print('Reconstructed signal for left hand:')
ipd.display( ipd.Audio(x_L, rate=Fs) )
print('Reconstructed signal for right hand:')
ipd.display( ipd.Audio(x_R, rate=Fs) )
Original signal: