Using Weakly Aligned Score—Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval

This is the accompanying website for the following paper:

  1. Frank Zalkow and Meinard Müller
    Using Weakly Aligned Score—Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval
    In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR): 184–191, 2020. PDF Details
    @inproceedings{ZalkowM20_WeaklyAlignedTrain_ISMIR,
    author      = {Frank Zalkow and Meinard M{\"u}ller},
    title       = {Using Weakly Aligned Score--Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval},
    booktitle   = {Proceedings of the International Society for Music Information Retrieval Conference ({ISMIR})},
    address     = {Montréal, Canada},
    year        = {2020},
    pages       = {184--191},
    url-details = {https://www.audiolabs-erlangen.de/resources/MIR/2020-ISMIR-chroma-ctc},
    url-pdf     = {2020_ZalkowM_CTC_ISMIR.pdf}
    }

Abstract

Many music information retrieval tasks involve the comparison of a symbolic score representation with an audio recording. A typical strategy is to compare score–audio pairs based on a common mid-level representation, such as chroma features. Several recent studies demonstrated the effectiveness of deep learning models that learn task-specific mid-level representations from temporally aligned training pairs. However, in practice, there is often a lack of strongly aligned training data, in particular for real-world scenarios. In our study, we use weakly aligned score–audio pairs for training, where only the beginning and end of a score excerpt is annotated in an audio recording, without aligned correspondences in between. To exploit such weakly aligned data, we employ the Connectionist Temporal Classification (CTC) loss to train a deep learning model for computing an enhanced chroma representation. We then apply this model to a cross-modal retrieval task, where we aim at finding relevant audio recordings of Western classical music, given a short monophonic musical theme in symbolic notation as a query. We present systematic experiments that show the effectiveness of the CTC-based model for this theme-based retrieval task.

Repository

Pre-trained models and code to apply them are available at:

Audio Excerpts and Jupyter Notebooks

The repository also contains two public domain audio excerpts. These excerpts are given in the following table.

ID Composer Work Performer Description Audio
1 Beethoven Symphony no. 5, op. 67 Davis High School Symphony Orchestra First movement, first theme
2 Beethoven Piano Sonata no. 2, op. 2 no. 2 Paul Pitman First movement, second theme

Furthermore, the repository contains a Jupyter notebook that shows how to apply the model described in the paper. The following table links HTML exports of this notebook for the two audio excerpts and all model variants in the repository. The model variants are due to different training and validation splits.

Audio Excerpt Model Variant Link
1train123valid4[link]
1train234valid5[link]
1train345valid1[link]
1train451valid2[link]
1train512valid3[link]
1train1234valid5[link]
1train2345valid1[link]
1train3451valid2[link]
1train4512valid3[link]
1train5123valid4[link]
2train123valid4[link]
2train234valid5[link]
2train345valid1[link]
2train451valid2[link]
2train512valid3[link]
2train1234valid5[link]
2train2345valid1[link]
2train3451valid2[link]
2train4512valid3[link]
2train5123valid4[link]

Acknowledgements

Frank Zalkow and Meinard Müller are supported by the German Research Foundation (DFG-MU 2686/11-1, MU 2686/12-1). We thank Daniel Stoller for fruitful discussions on the CTC loss, and Michael Krause for proof-reading the manuscript. We also thank Stefan Balke and Vlora Arifi-Müller as well as all students involved in the annotation work, especially Lena Krauß and Quirin Seilbeck. The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. The authors gratefully acknowledge the compute resources and support provided by the Erlangen Regional Computing Center (RRZE).

References

  1. Harold Barlow and Sam Morgenstern
    A Dictionary of Musical Themes
    Crown Publishers, Inc., 1975.
    @book{BarlowM75_MusicalThemes_BOOK,
    Author    = {Harold Barlow and Sam Morgenstern},
    Edition   = {Revised edition},
    Publisher = {Crown Publishers, Inc.},
    Title     = {A Dictionary of Musical Themes},
    Year      = {1975}
    }
  2. Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber
    Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
    In Proceedings of the International Conference on Machine Learning (ICML): 369–376, 2006.
    @inproceedings{GravesFGS15_CTC_ICML,
    author    = {Alex Graves and Santiago Fern{\'{a}}ndez and Faustino J. Gomez and J{\"{u}}rgen Schmidhuber},
    title     = {Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks},
    booktitle = {Proceedings of the International Conference on Machine Learning ({ICML})},
    pages     = {369--376},
    year      = {2006},
    address   = {Pittsburgh, Pennsylvania, USA}
    }
  3. Stefan Balke, Vlora Arifi-Müller, Lukas Lamprecht, and Meinard Müller
    Retrieving Audio Recordings Using Musical Themes
    In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 281–285, 2016.
    @inproceedings{BalkeALM16_BarlowRetrieval_ICASSP,
    author    = {Stefan Balke and Vlora Arifi-M{\"u}ller and Lukas Lamprecht and Meinard M{\"u}ller},
    title     = {Retrieving Audio Recordings Using Musical Themes},
    booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
    address   = {Shanghai, China},
    year      = {2016},
    pages     = {281--285},
    }
  4. Frank Zalkow, Stefan Balke, and Meinard Müller
    Evaluating Salience Representations for Cross-Modal Retrieval of Western Classical Music Recordings
    In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 311–335, 2019. Details
    @inproceedings{ZalkowBM19_SalienceRetrieval_ICASSP,
    author      = {Frank Zalkow and Stefan Balke and Meinard M{\"u}ller},
    title       = {Evaluating Salience Representations for Cross-Modal Retrieval of Western Classical Music Recordings},
    booktitle   = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
    address     = {Brighton, United Kingdom},
    year        = {2019},
    pages       = {311--335},
    url-details = {https://www.audiolabs-erlangen.de/resources/MIR/2019-ICASSP-BarlowMorgenstern/},
    }
  5. Frank Zalkow and Meinard Müller
    Using Weakly Aligned Score—Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval
    In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR): 184–191, 2020. PDF Details
    @inproceedings{ZalkowM20_WeaklyAlignedTrain_ISMIR,
    author      = {Frank Zalkow and Meinard M{\"u}ller},
    title       = {Using Weakly Aligned Score--Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval},
    booktitle   = {Proceedings of the International Society for Music Information Retrieval Conference ({ISMIR})},
    address     = {Montréal, Canada},
    year        = {2020},
    pages       = {184--191},
    url-details = {https://www.audiolabs-erlangen.de/resources/MIR/2020-ISMIR-chroma-ctc},
    url-pdf     = {2020_ZalkowM_CTC_ISMIR.pdf}
    }