{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ " International Audio Laboratories Erlangen \n", "\n", " \n", "\n", "\n", "# Short-Time Fourier Transform and Chroma Features\n", "\n", "Authors: Meinard Müller, Stefan Balke, Frank Zalkow\n", "\n", "References:
\n", "[Mueller2015] Meinard Müller. Fundamentals of Music Processing. Springer Verlag, 2015." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Abstract" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Fourier transform, which is used to convert a time-dependent signal\n", "to a frequency-dependent signal, is one of the most important mathematical tools\n", "in audio signal processing. Applying the Fourier transform to local sections\n", "of an audio signal, one obtains the short-time Fourier transform (STFT).\n", "In this lab course, we study a discrete version of the STFT.\n", "To work with the discrete STFT in practice, one needs to correctly\n", "interpret the discrete time and frequency parameters.\n", "Using Python, we compute a discrete STFT and visualize its magnitude in form of a\n", "spectrogram representation. Then, we derive from the STFT various audio features\n", "that are useful for analyzing music signals.\n", "In particular, we develop a log-frequency spectrogram, where the\n", "frequency axis is converted into an axis corresponding to musical pitches.\n", "From this, we derive a chroma representation, which is a useful tool\n", "for capturing harmonic information of music." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Audio signals can be complex mixtures consisting of a multitude of different sound components.\n", "A first step in better understanding a given signal is to decompose it into building\n", "blocks that are better accessible for the subsequent processing steps.\n", "In the case that these building blocks consist of complex-valued sinusoidal\n", "functions, such a process is also called Fourier analysis.\n", "The Fourier transform maps a time-dependent signal\n", "to a frequency-dependent function which reveals the spectrum of\n", "frequency components that compose the original signal.\n", "Loosely speaking, a signal and its Fourier transform are two sides\n", "of the same coin.\n", "On the one side, the signal displays the time information and hides the\n", "information about frequencies.\n", "On the other side, the Fourier transform reveals information about\n", "frequencies and hides the time information.\n", "\n", "To obtain back the hidden time information, Dennis Gabor introduced\n", "in the year 1946 the modified Fourier transform, now known as\n", "*short-time Fourier transform* or simply STFT.\n", "This transform is a compromise between a time- and a frequency-based\n", "representation by determining the sinusoidal magnitude\n", "and phase content of local sections of a signal as it changes over time.\n", "In this way, the STFT does not only tell which frequencies\n", "are \"contained\" in the signal but also at which points of times or,\n", "to be more precise, in which time intervals these frequencies appear.\n", "\n", " \n", "\n", "The figure sows various representations for a piano recording of the chromatic scale ranging from A0 ($p=21$) to C8 ($p=108$).\n", "**(a)** Piano keys representing the chromatic scale.\n", "**(b)** Spectrogram representation.\n", "**(c)** Pitch-based log-frequency spectrogram.\n", "**(d)** Chromagram representation.\n", "For visualization purposes the values are color-coded using a logarithmic scale.\n", "The C3 ($p=48$) played at time $t=30~{\\mathrm{sec}}$ has been highlighted\n", "by the rectangular frames.\n", "\n", "The main objective of this lab course is to acquire a good understanding\n", "of the STFT. To this end, we study a discrete version of the STFT\n", "using the discrete Fourier transform (DFT), which can be efficiently\n", "computed using the fast Fourier transform (FFT).\n", "The discrete STFT yields a discrete set of Fourier coefficients\n", "that are indexed by time and frequency parameters.\n", "The correct physical interpretation of these parameters\n", "in terms of units such as seconds and Hertz\n", "depends on the sampling rate, the window size, and the\n", "hop size used in the STFT computation.\n", "In this lab course, we will compute a discrete STFT using Python\n", "and then visualize its magnitude by a spectrogram representation,\n", "see [the STFT-section](#STFT).\n", "By applying the STFT to different audio examples and\n", "by modifying the various parameters, one should get a\n", "better understanding on how the STFT works in practice.\n", "\n", "To make music data comparable and algorithmically accessible,\n", "the first step in basically all music processing tasks is to extract\n", "suitable *features* that capture relevant aspects\n", "while suppressing irrelevant details.\n", "In the second part of this lab course, we study audio features and\n", "mid-level representations that are particularly useful for\n", "capturing pitch information of music signals.\n", "Assuming that we are dealing with music that is based on the equal-tempered scale\n", "(the scale that corresponds to the keys of a piano keyboard),\n", "we will convert an audio recording into a feature representation\n", "that reveals the distribution of the signal's energy across the different\n", "pitches, see [the Log-Frequency-Spectrogram-section](#LFS).\n", "Technically, these features are obtained from a spectrogram by converting\n", "the linear frequency axis (measured in Hertz) into a logarithmic axis\n", "(measured in pitches).\n", "From this log-frequency spectrogram, we then derive a time-chroma representation\n", "by suitably combining pitch bands that correspond to the same chroma,\n", "see [the Chroma-Features-section](#Chroma).\n", "The resulting chroma features show a high degree of robustness to\n", "variations in timbre and instrumentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STFT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Fourier transform and in particular the discrete STFT serve as\n", "*front-end transform*, the first computing step,\n", "for deriving a large number of different musically relevant audio features.\n", "We now recall the definition of the discrete STFT while fixing some notation.\n", "Let $x:[0:L-1]:=\\{0,1,\\ldots,L-1\\}\\to{\\mathbb R}$ be a real-valued discrete-time signal of length $L$ obtained by\n", "equidistant sampling with respect to a fixed sampling rate $F_\\mathrm{s}$ given in Hertz ($\\mathrm{Hz}$).\n", "Furthermore, let $w:[0:N-1]:=\\{0,1,\\ldots,N-1\\}\\to{\\mathbb R}$ be a discrete-time window\n", "of length $N\\in{\\mathbb N}$ (usually a power of two) and let $H\\in{\\mathbb N}$ be a hop size parameter.\n", "With regards to these parameters, the discrete STFT ${\\mathcal X}$ of the signal $x$ is given by\n", "\n", "\\begin{eqnarray}\n", " {\\mathcal X}(m,k):= \\sum_{n=0}^{N-1} x(n+mH)w(n)\\exp(-2\\pi ikn/N)\n", "\\end{eqnarray}\n", "\n", "with $m\\in[0:\\lfloor \\frac{L-N}{H} \\rfloor]$ and $k\\in[0:K]$. The complex number ${\\mathcal X}(m,k)$ denotes\n", "the $k^{\\mathrm{th}}$ Fourier coefficient for the $m^{\\mathrm{th}}$ time frame,\n", "where $K=N/2$ is the frequency index corresponding to the Nyquist frequency.\n", "Each Fourier coefficient ${\\mathcal X}(m,k)$ is associated with the physical time position\n", "(using the start position of the window as reference point)\n", "\n", "\\begin{equation}\n", " {T_{\\mathrm{coef}}(m)} := \\frac{m\\cdot H}{F_\\mathrm{s}}\n", " \\end{equation}\n", " \n", " given in seconds (${\\mathrm{sec}}$) and with the physical frequency\n", " \n", " \\begin{equation}\n", " F_{\\mathrm{coef}}(k) := \\frac{k\\cdot F_\\mathrm{s}}{N}\n", "\\end{equation}\n", "\n", "given in Hertz ($\\mathrm{Hz}$).\n", "For example, using $F_\\mathrm{s}=44100~\\mathrm{Hz}$ as for a CD recording,\n", "a window length of $N=4096$, and a hop size of $H=N/2$,\n", "we obtain a time resolution of $H/F_\\mathrm{s}\\approx 46.4~\\mathrm{ms}$\n", "and frequency resolution of $F_\\mathrm{s}/N\\approx 10.8~\\mathrm{Hz}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Homework Exercise 1
\n", "Compute the time and frequency resolution of the resulting STFT when using the following parameters of $F_\\mathrm{s}$, $N$ and $H$. What are the Nyquist frequencies?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# write the functions T_coef and F_coef...\n", "\n", "Fs, N, H = 22050, 1024, 512\n", "print('Fs = %5d, N = %d, H = %4d: Tcoef = %6.2f msec, Fcoef = %5.2f Hz, Nyquist = %.2f Hz' % (Fs, N, H, T_coef(1, H, Fs)*1000, F_coef(1, N, Fs), Fs/2))\n", "\n", "Fs, N, H = 48000, 1024, 256\n", "print('Fs = %5d, N = %d, H = %4d: Tcoef = %6.2f msec, Fcoef = %5.2f Hz, Nyquist = %.2f Hz' % (Fs, N, H, T_coef(1, H, Fs)*1000, F_coef(1, N, Fs), Fs/2))\n", "\n", "Fs, N, H = 4000, 4096, 1024\n", "print('Fs = %5d, N = %d, H = %4d: Tcoef = %6.2f msec, Fcoef = %5.2f Hz, Nyquist = %.2f Hz' % (Fs, N, H, T_coef(1, H, Fs)*1000, F_coef(1, N, Fs), Fs/2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Using $F_\\mathrm{s}=44100$, $N=2048$ and $H=1024$, what is the physical meaning of the Fourier coefficients\n", " ${\\mathcal X}(1000,1000)$, ${\\mathcal X}(17,0)$, and ${\\mathcal X}(56,1024)$?\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# write the function ex1_2...\n", "\n", "Fs, N, H = 44100, 2048, 1024\n", "\n", "m, k = 1000, 1000\n", "ex1_2(Fs, N, H, k, m)\n", "\n", "m, k = 17, 0\n", "ex1_2(Fs, N, H, k, m)\n", "\n", "m, k = 56, 1024\n", "ex1_2(Fs, N, H, k, m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The STFT is often visualized by means of a *spectrogram*,\n", "which is a two-dimensional representation of the squared magnitude:\n", "\n", "\\begin{equation}\n", " {\\mathcal Y}(m,k) = |{\\mathcal X}(m,k)|^2.\n", "\\end{equation}\n", "\n", "When generating an image of a spectrogram, the horizontal axis represents time,\n", "the vertical axis is frequency, and the dimension indicating the spectrogram value\n", "of a particular frequency at a particular time is represented by the intensity or\n", "color in the image." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Lab Experiment 1
\n", "Use the function sf.read to read the file Sound_TwoSineTwoImpulse.wav.\n", " This defines a signal $x$ as well as the sampling rate $F_\\mathrm{s}$.\n", " In the case that the signal is stereo, only use the first channel.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import soundfile as sf\n", "from IPython.display import Audio\n", "\n", "# your code here...\n", " \n", "Audio(x, rate=Fs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Initialize a length parameter $N=4096$ and a hop size parameter $H=2048$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Define a hann window function $w$ of length $N$ (using scipy.signal.get_window).\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import signal\n", "\n", "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute ${\\mathcal X}$ using the function librosa.stft.\n", " The resulting matrix contains the complex-valued Fourier coefficients ${\\mathcal X}(m,k)$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import librosa\n", "\n", "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute the spectrogram ${\\mathcal Y}(m,k)$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compte the vector t containing the physical time positions (in seconds) of the time indices.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here, compute t..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute vector f containing the frequency values (in Hertz) of the frequency indices.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here, compute f..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Visualize the spectrogram in various ways\n", " with the axis given in form of indices.\n", " Use an appropriate figure size with the figsize keyword of plt.figure.\n", " For visualizing $\\mathcal{Y}$, use the function plt.imshow.\n", " Explore its parameters\n", " aspect, origin, cmap.\n", " Furthermore, use the functions plt.colorbar(), plt.xlabel() and plt.ylabel().\n", " Doing so, also get familiar with the various visualization parameters\n", " and tools offered by Python.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "\n", "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Plot the spectrogram with the axis given in seconds and Hertz.\n", " This should with the extent keyword, using t and f.\n", " Furthermore, do only visualize the lowest 2 kHz by using plt.ylim.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Next, use a logarithmic decibel-scale for visualizing the values ${\\mathcal Y}(m,k)$. (Recall that, given a value\n", " $v \\in {\\mathbb R}$, the decibel value is $10 \\log_{10}(v)$.)\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute spectrograms using different window sizes\n", " (for example, $N\\in\\{256,1024,4096,8192\\}$) and\n", " different hop sizes (for example, $H\\in\\{1,N/4,N/2\\}$).\n", " Do only visualize the lowest 2 kHz.\n", " Discuss the trade-off between time resolution and frequency resolution.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Try out other audio files.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The human sensation of the intensity of a sound is logarithmic in nature.\n", "In practice, sounds that have an extremely small intensity may still be\n", "relevant for human listeners.\n", "Therefore, one often uses a decibel scale, which is a logarithmic unit expressing\n", "the ratio between two values.\n", "As alternative of using a decibel scale, one often applies in audio processing\n", "a step also referred to as *logarithmic compression*,\n", "which works as follows. Let $\\gamma\\in{\\mathbb R}_{>0}$ be a positive constant\n", "and $\\Gamma_\\gamma:{\\mathbb R}_{>0} \\to {\\mathbb R}_{>0}$ a function defined by\n", "\n", "\\begin{equation}\n", " \\Gamma_\\gamma(v):=\\log(1+ \\gamma \\cdot v).\n", "\\end{equation}\n", "\n", "for $v\\in{\\mathbb R}_{>0}$, where we use the natural logarithm.\n", "Note that the function $\\Gamma_\\gamma$ yields a positive\n", "value $\\Gamma_\\gamma(v)$ for any positive value $v\\in{\\mathbb R}_{>0}$.\n", "Now, for a representation with positive values such as a spectrogram,\n", "one obtains a compressed version by applying the function $\\Gamma_\\gamma$\n", "to each of the values:\n", "\n", "\\begin{equation}\n", " (\\Gamma_\\gamma\\circ {\\mathcal Y})(m,k):=\\log(1+ \\gamma \\cdot {\\mathcal Y}(m,k)).\n", "\\end{equation}\n", "\n", "Why is this operation called *compression* and what is the role of\n", "the constant $\\gamma$? The problem with representations such as a spectrogram\n", "is that its values possess a large dynamic range. As a result,\n", "small, but still relevant values may be dominated by large values.\n", "Therefore, the idea of compression is to balance out this discrepancy\n", "by reducing the difference between large and small values with\n", "the effect to enhance the small values.\n", "This exactly is done by the function $\\Gamma_\\gamma$, where the degree of compression\n", "can be adjusted by the constant $\\gamma$. The larger $\\gamma$,\n", "the larger the resulting compression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Homework Exercise 2
\n", "Plot the function $\\Gamma_\\gamma$ for the parameters $\\gamma\\in\\{1,10,100\\}$.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Lab Experiment 2
\n", "Use the file Tone_C4_Piano.wav to define a signal $x$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute the STFT and the spectrogram ${\\mathcal Y}$ as above using a Hann window\n", " of size $N=4096$ and a hop size $H=2048$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute the compressed version $\\Gamma_\\gamma\\circ {\\mathcal Y}$ of the spectrogram using different constants $\\gamma\\in\\{1,10,100,1000,10000\\}$.\n", "Visualize the original spectrogram and its compressed versions.\n", "What do you see? Discuss the results.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Try out other audio files.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Log-Frequency Spectrogram" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now derive some audio features from the STFT by converting the\n", "frequency axis (given in Hertz) into an axis that corresponds to musical pitches.\n", "In Western music, the *equal-tempered scale* is most often used,\n", "where the pitches of the scale correspond to the keys of a piano keyboard.\n", "In this scale, each octave (the interval between two tones, whose fundamental frequencies differ by a factor of two) is split up into twelve logarithmically spaced units.\n", "In MIDI notation, one considers $128$ pitches, which are serially\n", "numbered starting with $0$ and ending with $127$.\n", "The MIDI pitch $p=69$ corresponds to the pitch ${\\mathbb N}oteMusic{A}{4}$\n", "(having a center frequency of $440~\\mathrm{Hz}$), which is often used as standard\n", "for tuning musical instruments.\n", "In general, the center frequency $F_{\\mathrm{pitch}}(p)$ of a pitch $p\\in[0:127]$ is\n", "given by the formula\n", "\n", "\\begin{equation}\n", "F_{\\mathrm{pitch}}(p) = 2^{(p-69)/12} \\cdot 440.\n", "\\end{equation}\n", "\n", "The logarithmic perception of frequency motivates the use of a time-frequency\n", "representation with a logarithmic frequency axis labeled by the pitches of\n", "the equal-tempered scale.\n", "To derive such a representation from a given spectrogram representation,\n", "the basic idea is to assign each spectral coefficient ${\\mathcal X}(m,k)$ to the pitch\n", "with center frequency that is closest to the frequency $F_{\\mathrm{coef}}(k)$.\n", "More precisely, we define for each pitch $p\\in[0:127]$ the set\n", "\n", "\\begin{equation}\n", " P(p) := \\{k\\in[0:K]:F_{\\mathrm{pitch}}(p - 0.5) \\leq F_{\\mathrm{coef}}(k) < F_{\\mathrm{pitch}}(p + 0.5)\\}.\n", "\\end{equation}\n", "\n", "From this, we obtain a log-frequency spectrogram\n", "${\\mathcal Y}_\\mathrm{LF}:{\\mathbb Z}\\times [0:127]\\to{\\mathbb R}_{\\geq 0}$ defined by\n", "\n", "\\begin{equation}\n", " {\\mathcal Y}_\\mathrm{LF}(m,p) := \\sum_{k \\in P(p)}{|{\\mathcal X}(m,k)|^2}.\n", "\\end{equation}\n", "\n", "By this definition, the frequency axis is partitioned logarithmically and\n", "labeled linearly according to MIDI pitches." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Homework Exercise 3
\n", "Compute the center frequencies $F_{\\mathrm{pitch}}(p)$ for $p=68$, $p=69$, and $p=70$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here...\n", "\n", "print('Fpitch(%d) = %.2f Hz' % (68, F_pitch(68)))\n", "print('Fpitch(%d) = %.2f Hz' % (69, F_pitch(69)))\n", "print('Fpitch(%d) = %.2f Hz' % (70, F_pitch(70)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute the cutoff frequencies $F_{\\mathrm{pitch}}(p - 0.5)$ and $F_{\\mathrm{pitch}}(p + 0.5)$\n", " of the frequency band corresponding to pitch $p=69$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Using $F_\\mathrm{s}=22050$ and $N=4096$, determine the set $P(p) \\subseteq [0:K]$ for $p=69$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here...\n", " \n", "Fs, N = 22050, 4096\n", "print('P(%d) = %s' % (69, P(69, Fs, N)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Also compute $P(p)$ for $p=57$, $p=45$, and $p=33$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Please explain with your own words why the definition of\n", " ${\\mathcal Y}_\\mathrm{LF}(m,p)$ may be\n", " problematic for small values of pitch $p$? How is the size of the\n", " set $P(p)$ influenced? Support your answer with a brief example.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Lab Experiment 3
\n", "Use the file Scale_Cmajor_Piano.wav to define a signal $x$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here...\n", "\n", "Audio(x, rate=Fs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute the STFT and the spectrogram as above using a Hann window\n", " of size $N=4096$ and a hop size $H=2048$.\n", " In the following you also need the information contained in the frequency vector F.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Compute the log-frequency spectrogram ${\\mathcal Y}_\\mathrm{LF}$.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Visualize the log-frequency spectrogram with the axes given in seconds and\n", " MIDI pitch, respectively.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Use log-compression to enhance\n", " the visualization.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Play around with different parameter settings for $N$ and $H$.\n", " Also, try out some other audio files.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chroma-Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The human perception of pitch is periodic in the sense that two pitches are\n", "perceived as similar in *color* playing a similar harmonic role) if they\n", "differ by one or several octaves (where, in our scale, an octave is defined as the distance of $12$ pitches).\n", "For example, the pitches $p=60$ and $p=72$ are one octave apart, and the pitches\n", "$p=57$ and $p=81$ are two octaves apart.\n", "A pitch can be separated into two components,\n", "which are referred to as *tne height* and *chroma*\n", "The tone height refers to the octave number\\index{octave number}\n", "and the chroma to the respective pitch spelling attribute.\n", "In Western music notation, the $12$ pitch attributes are given by the\n", "set $\\{\\mathrm{C},\\mathrm{C}^{\\sharp},\\mathrm{D},\\ldots,\\mathrm{B}\\}$.\n", "Enumerating the chroma values, we identify this set with $[0:11]$\n", "where $c=0$ refers to chroma $\\mathrm{C}$, $c=1$ to $\\mathrm{C}^{\\sharp}$, and so on.\n", "A *pitch class* is defined as the set of all pitches that\n", "share the same chroma. For example, the pitch class that corresponds\n", "to the chroma $c=0$ ($\\mathrm{C}$) consists of the set\n", "$\\{0,12,24,36,48,60,72,84,96,108,120\\}$\n", "(which are the musical notes\n", "$\\{\\ldots,\\,\\mathrm{C}\\mathrm{0},\\mathrm{C}\\mathrm{1},\\mathrm{C}\\mathrm{2},\\mathrm{C}\\mathrm{3}\\ldots\\}$).\n", "\n", "The main idea of *chroma features* is to aggregate all\n", "spectral information that relates to a given pitch class into a single coefficient.\n", "Given a pitch-based log-frequency spectrogram\n", "${\\mathcal Y}_\\mathrm{LF}:{\\mathbb Z}\\times[0:127]\\to {\\mathbb R}_{\\geq 0}$,\n", "a chroma representation or *chromagram*\n", "${\\mathbb Z}\\times[0:11]\\to {\\mathbb R}_{\\geq 0}$ can be derived\n", "by summing up all pitch coefficients that belong to the same chroma:\n", "\n", "\\begin{equation}\n", " {\\mathcal C}(m,c) := \\sum_{\\{p \\in [0:127]\\,|\\,p\\,\\mathrm{mod}\\,12 = c\\}}{{\\mathcal Y}_\\mathrm{LF}(m,p)}\n", "\\end{equation}\n", "for $c\\in[0:11]$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Lab Experiment 4
\n", "Derive the chroma representation ${\\mathcal C}$ from the log-frequency spectrogram\n", " as computed in the last exercise.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Visualize the chroma representation with the axes given in seconds and\n", " chroma indices, respectively. Try to explain what you see in the chroma visualization.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Also play around with different parameter settings for $N$ and $H$\n", " and try out some other audio files.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Acknowledgment:** The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "   " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }