Fundamental frequency (F0) estimation is a critical task in audio, speech, and music processing applications, such as speech analysis and melody extraction. F0 estimation algorithms generally fall into two paradigms: classical signal processing-based methods and neural network-based approaches. Classical methods, like YIN and SWIPE, rely on explicit signal models, offering interpretability and computational efficiency, but their non-differentiable components hinder integration into deep learning pipelines. Neural network-based methods, such as CREPE, are fully differentiable and flexible but often lack interpretability and require substantial computational resources. In this paper, we propose differentiable variants of two classical algorithms, dYIN and dSWIPE, which combine the strengths of both paradigms. These variants enable gradient-based optimization while preserving the efficiency and interpretability of the original methods. Through several case studies, we demonstrate their potential: First, we use gradient descent to reverse-engineer audio signals, showing that dYIN and dSWIPE produce smoother gradients compared to CREPE. Second, we design a two-stage vocal melody extraction pipeline that integrates music source separation with a differentiable F0 estimator, providing an interpretable intermediate representation. Finally, we optimize dSWIPE's spectral templates for timbre-specific F0 estimation on violin recordings, demonstrating its enhanced adaptability over SWIPE. These case studies highlight that dYIN and dSWIPE successfully combine the flexibility of neural network-based methods with the interpretability and efficiency of classical algorithms, making them valuable tools for building end-to-end trainable and transparent systems.
Annotations related to musical events such as chord labels, measure numbers, or structural descriptions are typically provided in textual format within or alongside a score-based representation of a piece. However, following these annotations while listening to a recording can be challenging without additional visual or auditory display. In this paper, we introduce an approach for enriching the listening experience by mixing music recordings with synthesized text annotations. Our approach aligns text annotations from a score-based timeline to the timeline of a specific recording and then utilizes text-to-speech synthesis to acoustically superimpose them with the recording. We describe a processing pipeline for implementing this approach, allowing users to customize settings such as speaking language, speed, speech positioning, and loudness. Case studies include synthesizing text comments on measure positions in Schubert songs, chord annotations for Beatles songs, structural elements of Beethoven piano sonatas, and leitmotif occurrences in Wagner operas. Beyond these specific examples, our aim is to highlight the broader potential of speech-based auditory display. This approach offers valuable tools for researchers seeking a deeper understanding of datasets and their annotations, for evaluating music information retrieval algorithms, or for educational purposes in instrumental training, music-making, and aural training.