{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "Following Section 5.2.2 of [Müller, FMP, Springer 2015], we introduce in this notebook strategies to evaluate the performance of chord recognition algorithms.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "In order to evaluate the quality of a chord recognition procedure, a general approach is to compare the computed result against a **reference annotation**. However, such an evaluation often gives rise to various questions.\n", "\n", "* How should the agreement between the computed result and the reference annotation\n", "be quantified? \n", "* Is the reference annotation well defined? Is it reliable?\n", "* Are the model assumptions in the formalization of the chord recognition task appropriate? \n", "* To what extent do violations of these assumptions influence the final result?\n", "\n", "For all these questions there are no definite answers, and evaluation results need to be taken with care. Still, quantitative evaluations are useful indicators. They generally reflect the overall performance of automated procedures and give valuable insights into the characteristics \n", "of the underlying music data.\n", "\n", "In this notebook, we introduce a simple evaluation approach. To this end, we assume that there exists a **reference annotation**, which is also often called **ground truth**. Typically, such a reference annotation is generated by music experts, often based on a score representation. In general, however, no well-defined ground truth exists, and even music experts may disagree on how to annotate a given piece of music. Furthermore, the annotations may depend on the employed temporal granularity (e.g., note, beat, measure, or phrase level). Furthermore, one needs to adapt the manual annotations to make them comparable with the computed results. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accuracy, Precision, Recall, F-Measure\n", "\n", "Keeping these issues in mind, we now introduce a simple evaluation measure. Recall that, given a **chroma sequence** $X=(x_1,x_2,\\ldots,x_N)$ and a set \n", "\n", "$$\n", " \\Lambda := \\{\\mathbf{C},\\mathbf{C}^\\sharp,\\ldots,\\mathbf{B},\\mathbf{Cm},\\mathbf{Cm^\\sharp},\\ldots,\\mathbf{Bm}\\} \n", "$$\n", "\n", "of possible **chord labels**, the output of our [template-based chord recognition](../C5/C5S2_ChordRec_Templates.html) procedure consists of a chord label $\\lambda_{n}\\in\\Lambda$ for each frame $n\\in[1:N]$. Furthermore, we assume that we are given a reference annotation that is also specified in a frame-wise fashion $\\lambda^\\mathrm{Ref}_{n}\\in\\Lambda$ for $n\\in[1:N]$\n", "\n", "However, we do not want to assume that every frame needs to be annotated. For example, if the recording starts with silence or ends with applause, there is no meaningful chord annotation for these sections, and the corresponding frames should be left unconsidered in the evaluation. To model this, we extend our label set by an additional symbol $\\mathbf{N}$, which stands for **non-chord label**. Thus, our new label set has $25$ elements and is defined as:\n", "\n", "\\begin{equation}\n", " \\Lambda':= \\Lambda \\cup \\{\\mathbf{N}\\}\n", "\\end{equation}\n", "\n", "In the following, the reference annotation is given by labels $\\lambda^\\mathrm{Ref}_{n}\\in\\Lambda'$ for $n\\in[1:N]$. Furthermore, we assume $\\lambda_{n}\\in\\Lambda'$ for $n\\in[1:N]$, i.e., our chord recognizer may also output the non-chord label (e.g., in the case that all similarity values between a chroma frame and the templates are below a certain threshold). The evaluation can then be performed frame-wise by comparing the computed labels $\\lambda_{n}$ with the reference labels $\\lambda^\\mathrm{Ref}_{n}$. For a given frame $n\\in[1:N]$, we say that a label predicted by the chord recognition approach is **correct** if $\\lambda_{n}=\\lambda^\\mathrm{Ref}_{n}$, otherwise it is considered **incorrect**. The **accuracy** $\\mathrm{A}$ is then defined as the proportion of correctly predicted labels:\n", "\n", "$$\n", " \\mathrm{A} = \\frac{\\big|\\{n\\in[1:N]: \\lambda_{n}=\\lambda^\\mathrm{Ref}_{n} \\} \\big|}{N}.\n", "$$\n", "\n", "In the context of [music structure analysis](../C4/C4.html), we introduced in the [FMP notebook on evaluation](../C4/C4S5_Evaluation.html) evaluation measures that are used in general **information retrieval**. As an alternative to our accuracy measure, we now adapt these notions to our chord recognition scenario. First, we define the set of items to be $\\mathcal{I}=[1:N]\\times \\Lambda$. In particular, the non-chord label $\\mathbf{N}$ is left unconsidered. Then, \n", "\n", "$$\n", "\\mathcal{I}^\\mathrm{Ref}_+:=\\big\\{(n,\\lambda^\\mathrm{Ref}_{n})\\in\\mathcal{I} : n\\in[1:N]\\big\\}\n", "$$\n", "\n", "are the **positive** (or **relevant items**) and \n", "\n", "$$\n", "\\mathcal{I}^\\mathrm{Est}_+:=\\big\\{(n,\\lambda_{n})\\in\\mathcal{I} : n\\in[1:N]\\big\\}\n", "$$ \n", "\n", "are the items **estimated as positive** (or **retrieved items**). With these notions, an item $(n,\\lambda_{n})$ is called a **true positive** (TP) in the case that the label is correct (i.e., $\\lambda_{n}=\\lambda^\\mathrm{Ref}_{n}$). Otherwise, $(n,\\lambda_{n})$ is called a **false positive** (FP) and $(n,\\lambda^\\mathrm{Ref}_{n})$ a **false negative** (FN). All other items in $\\mathcal{I}$ are called **true negative**. With these notions, one can define the standard **precision** (P), **recall** (R), and **F-measure** (F):\n", "\n", "$$\n", " \\mathrm{P} = \\frac{\\#\\mathrm{TP}}{\\#\\mathrm{TP}+\\#\\mathrm{FP}}, \\quad\n", " \\mathrm{R} = \\frac{\\#\\mathrm{TP}}{\\#\\mathrm{TP}+\\#\\mathrm{FN}}, \\quad\n", " \\mathrm{F} = \\frac{2\\cdot \\mathrm{P}\\cdot \\mathrm{R}}{\\mathrm{P} + \\mathrm{R}}.\n", "$$\n", "\n", "For further explanations, we refer to the [FMP notebook on evaluation](../C4/C4S5_Evaluation.html). In the way we have formulated our chord recognition problem so far, we have exactly one label per frame in the reference as well as in the estimation. From this follows that $\\#\\mathrm{FP}=\\#\\mathrm{FN}$ and that the definition of accuracy coincides with precision, recall, and F-measure. However, when modeling the problem in a different way (e.g., allowing for frames that are not annotated), this does not hold any longer. We will discuss such a case later when considering 'non-chord' labels. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Beatles Example\n", "\n", "To illustrate the definitions, we again use the first measures of the Beatles song \"Let It Be\" and apply the [template-based chord recognition](../C5/C5S2_ChordRec_Templates.html). The following figure shows a score representation along with two different chord annotations provided by music experts. The first annotation is specified on the half-measure level (every two quarter notes) and only used the $24$ major and minor triads. The second annotation is not only specified on a finer temporal level, but also allows more chord types including [seventh chords](../C5/C5S1_Chords.html). In the following, we use the first simplistic annotation as reference, which is temporally adjusted to match the music recording.\n", "\n", "\n", "\n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |