This is the accompanying website for the ISMIR paper Tuning Matters: Analyzing Musical Tuning Bias in Neural Vocoders by Hans-Ulrich Berendes, Ben Maman, Meinard Müller.
| Vocoders, which reconstruct time-domain mp3eforms from spectral representations such as mel-spectrograms, are essential in modern music and speech synthesis. Traditional signal-processing techniques like the Griffin-Lim algorithm have largely been replaced by neural vocoders, which leverage generative models to achieve superior audio quality. However, these models can introduce artifacts and biases, potentially affecting their output in unforeseen ways. In this study, we examine how different musical tunings affect neural mel-to-audio vocoders within the context of Western music, where performances do not necessarily adhere to the modern 440 Hz standard tuning. As a key contribution, we evaluate several recent neural vocoders on datasets containing piano, violin, and singing voice recordings. Our results reveal that different vocoders exhibit distinct biases, causing deviation in tuning, and affecting mp3eform reconstruction quality in case of non-standard tuning. Our work underscores the need for improved vocoder robustness in music synthesis and provides insights for refining future models. | ![]() |
In the following we want to illustrate how the quality of the vocoder output can deteriorate for non-standard tuning. To this end we synthesize a piano sonata with a physical piano model, which allows to explicitly specify the tuning. We synthesize 5 different tunings, and subsequently process each of the 5 renditions with the tested vocoder model. As you will hear, as we move away from standard tuning, audible artifacts appear for some of the models.
We provide the scatter plots of input vs. output tuning for all vocoders and datasets with both tuning estimators: TempMatch (libfmp) tuning estimator and FreqHist (librosa) tuning estimator.
Here we provide additional evaluation for the listening test results. For the main results, please check the paper. In the following plot, we aggregate the number of decided preference votes of our AB listening test into the groups "Preference Pitch-Shifted" and "Preference Non-Pitch-Shifted" for each vocoder. We can see that for all neural models, listener's indicate a slight preference towards the non-shifted items. This suggests that pitch-shifting has a slight negative influence on the quality of the output for the neural vocoder models, irrespective of tuning.
| This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Grant No.350953655 (MU 2686/11-2) and Grant No.500643750 (MU 2686/15-1). The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. | ![]() |