CTC-Based Learning of Chroma Features for Score—Audio Music Retrieval

This is the accompanying website for the following paper:

  1. Frank Zalkow and Meinard Müller
    CTC-Based Learning of Chroma Features for Score—Audio Music Retrieval
    IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2957–2971, 2021. PDF Details DOI
    @article{ZalkowMueller21_ChromaCTC_TASLP,
    author      = {Frank Zalkow and Meinard M{\"u}ller},
    title       = {{CTC}-Based Learning of Chroma Features for Score--Audio Music Retrieval},
    journal     = {{IEEE}/{ACM} Transactions on Audio, Speech, and Language Processing},
    volume      = {29},
    pages       = {2957--2971},
    year        = {2021},
    doi         = {10.1109/TASLP.2021.3110137},
    url-details = {https://www.audiolabs-erlangen.de/resources/MIR/2021_TASLP-ctc-chroma},
    url-pdf     = {https://ieeexplore.ieee.org/document/9531521},
    }

Abstract

This paper deals with a score–audio music retrieval task where the aim is to find relevant audio recordings of Western classical music, given a short monophonic musical theme in symbolic notation as a query. Strategies for comparing score and audio data are often based on a common mid-level representation, such as chroma features, which capture melodic and harmonic properties. Recent studies demonstrated the effectiveness of deep neural networks that learn task-specific mid-level representations. Usually, such supervised learning approaches require score–audio pairs where individual note events of the score are aligned to the corresponding time positions of the audio excerpt. However, in practice, it is tedious to generate such strongly aligned training pairs. As one contribution of this paper, we show how to apply the Connectionist Temporal Classification (CTC) loss in the training procedure, which only uses weakly aligned training pairs. In such a pair, only the time positions of the beginning and end of a theme occurrence are annotated in an audio recording, rather than requiring local alignment annotations. We evaluate the resulting features in our theme retrieval scenario and show that they improve the state of the art for this task. As a main result, we demonstrate that with the CTC-based training procedure using weakly annotated data, we can achieve results almost as good as with strongly annotated data. Furthermore, we assess our chroma features in depth by inspecting their temporal smoothness or granularity as an important property and by analyzing the impact of different degrees of musical complexity on the features.

Material

To make the results of our paper transparent and accessible, we provide various tools and interfaces. First, we make all details of our retrieval results available on an interactive web interface. Second, we provide pre-trained models and code to apply them. Third, our training data is publicly accessible.

Video

Our journal article represents a significant expansion of a previous ISMIR conference paper [6]. The following video was presented at ISMIR 2020 for that conference paper.

Playing back this video will transmit data to the service provider. To activate the video player, please press the button below.

Acknowledgements

Frank Zalkow and Meinard Müller are supported by the German Research Foundation (DFG-MU 2686/11-1, MU 2686/12-1). We thank Christof Weiß for proof-reading the manuscript. The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. The authors gratefully acknowledge the compute resources and support provided by the Erlangen Regional Computing Center (RRZE).

References

  1. Harold Barlow and Sam Morgenstern
    A Dictionary of Musical Themes
    Crown Publishers, Inc., 1975.
    @book{BarlowM75_MusicalThemes_BOOK,
    Author    = {Harold Barlow and Sam Morgenstern},
    Edition   = {Revised edition Third Printing},
    Publisher = {Crown Publishers, Inc.},
    Title     = {A Dictionary of Musical Themes},
    Year      = {1975}
    }
  2. Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber
    Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
    In Proceedings of the 23rd International Conference on Machine Learning (ICML): 369–376, 2006.
    @inproceedings{GravesFGS06_CTCLoss_ICML,
    author    = {Alex Graves and Santiago Fern{\'{a}}ndez and Faustino J. Gomez and J{\"{u}}rgen Schmidhuber},
    title     = {Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks},
    booktitle = {Proceedings of the 23rd International Conference on Machine Learning ({ICML})},
    pages     = {369--376},
    address   = {Pittsburgh, Pennsylvania, USA},
    year      = {2006}
    }
  3. Stefan Balke, Vlora Arifi-Müller, Lukas Lamprecht, and Meinard Müller
    Retrieving Audio Recordings Using Musical Themes
    In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 281–285, 2016.
    @inproceedings{BalkeALM16_BarlowRetrieval_ICASSP,
    author    = {Stefan Balke and Vlora Arifi-M{\"u}ller and Lukas Lamprecht and Meinard M{\"u}ller},
    title     = {Retrieving Audio Recordings Using Musical Themes},
    booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
    address   = {Shanghai, China},
    year      = {2016},
    pages     = {281--285},
    }
  4. Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, and Juan P. Bello
    Deep Salience Representations for F0 tracking in Polyphonic Music
    In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR): 63–70, 2017.
    @inproceedings{BittnerMSLB17_DeepSalience_ISMIR,
    author    = {Rachel M. Bittner and Brian McFee and Justin Salamon and Peter Li and Juan P. Bello},
    title     = {Deep Salience Representations for {F0} tracking in Polyphonic Music},
    booktitle = {Proceedings of the International Society for Music Information Retrieval Conference ({ISMIR})},
    pages     = {63--70},
    year      = {2017},
    address   = {Suzhou, China},
    }
  5. Frank Zalkow, Stefan Balke, and Meinard Müller
    Evaluating Salience Representations for Cross-Modal Retrieval of Western Classical Music Recordings
    In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 311–335, 2019.
    @inproceedings{ZalkowBM19_SalienceRetrieval_ICASSP,
    author      = {Frank Zalkow and Stefan Balke and Meinard M{\"u}ller},
    title       = {Evaluating Salience Representations for Cross-Modal Retrieval of {W}estern Classical Music Recordings},
    booktitle   = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
    address     = {Brighton, United Kingdom},
    year        = {2019},
    pages       = {311--335},
    }
  6. Frank Zalkow and Meinard Müller
    Using Weakly Aligned Score—Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval
    In Proceedings of the International Conference on Music Information Retrieval (ISMIR): 184–191, 2020.
    @inproceedings{ZalkowM20_BarlowCTC_ISMIR,
    author      = {Frank Zalkow and Meinard M{\"u}ller},
    title       = {Using Weakly Aligned Score--Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval},
    booktitle   = {Proceedings of the International Conference on Music Information Retrieval ({ISMIR})},
    address     = {Montr{\'e}al, Canada},
    year        = {2020},
    pages       = {184--191}
    }
  7. Frank Zalkow, Stefan Balke, Vlora Arifi-Müller, and Meinard Müller
    MTD: A Multimodal Dataset of Musical Themes for MIR Research
    Transactions of the International Society for Music Information Retrieval (TISMIR), 3(1): 180–192, 2020. Details DOI
    @article{ZalkowBAM20_MTD_TISMIR,
    title       = {{MTD}: A Multimodal Dataset of Musical Themes for {MIR} Research},
    author      = {Frank Zalkow and Stefan Balke and Vlora Arifi-M{\"{u}}ller and Meinard M{\"{u}}ller},
    journal     = {Transactions of the International Society for Music Information Retrieval ({TISMIR})},
    volume      = {3},
    number      = {1},
    year        = {2020},
    pages       = {180--192},
    doi         = {10.5334/tismir.68},
    url-details = {https://transactions.ismir.net/articles/10.5334/tismir.68/}
    }