On this website you find an overview of various demos of my research group.
Music with its many representations can be seen as a multimedia scenario: There exist a number of different media objects (e.g., video recordings, lyrics, or sheet music) beside the actual music recording, which describe the music in different ways. In the course of digitization efforts, many of these media objects are nowadays publically available on the Internet. However, the media objects are ususally accessed individually without using their musical relationships. Using these relationships could open up new ways of navigating and interacting with the music. In this work, we model these relationships by taking the opera Die Walküre by Richard Wagner as a case study. As a first step, we describe the opera as a multimedia scenario and introduce the considered media objects. By using manual annotations, we establish mutual relationships between the media objects. These relationships are then modelled in a database schema. Finally, we preset a web-based demonstrator which offers several ways of navigation within the opera recordings and allows for accessing the media objects in a user-friendly way.
To learn an instrument, many people acquire the necessary sensorimotor and musical skills by imitating their teachers. In the case of studying jazz improvisation, the student needs to learn fundamental harmonic principles. In this work, we indicate the potential of incorporating computer-assisted methods in jazz piano lessons. In particular, we present a web-based tool offers an easy interaction with the offered multimedia content. This tool enables the student to revise the lesson's content with the help of recorded and annotated examples in an individual tempo.
Harmonic-percussive separation is a technique that splits music recordings into harmonic and percussive components—it can be used as a preprocessing step to facilitate further tasks like key detection (harmonic component) or drum transcription (percussive component). In this demo, we propose a cascaded harmonic-residual-percussive (HRP) procedure yielding a mid-level feature to analyze musical phenomena like percussive event density, timbral changes, and homogeneous structural segments.
In this work, we focus on transcribing walking bass lines, which provide clues for revealing the actual played chords in jazz recordings. Our transcription method is based on a deep neural network (DNN) that learns a mapping from a mixture spectrogram to a salience representation that emphasizes the bass line. Furthermore, using beat positions, we apply a late-fusion approach to obtain beat-wise pitch estimates of the bass line. First, our results show that this DNN-based transcription approach outperforms state-of-the-art transcription methods for the given task. Second, we found that an augmentation of the training set using pitch shifting improves the model performance. Finally, we present a semi-supervised learning approach where additional training data is generated from predictions on unlabeled datasets.
Retrieving short monophonic queries in music recordings is a challenging research problem in Music Information Retrieval (MIR). In jazz music, given a solo transcription, one retrieval task is to find the corresponding (potentially polyphonic) recording in a music collection. Many conventional systems approach such retrieval tasks by first extracting the predominant F0-trajectory from the recording, then quantizing the extracted trajectory to musical pitches and finally comparing the resulting pitch sequence to the monophonic query. In this paper, we introduce a data-driven approach that avoids the hard decisions involved in conventional approaches: Given pairs of time-frequency (TF) representations of full music recordings and TF representations of solo transcriptions, we use a DNN-based approach to learn a mapping for transforming a "polyphonic" TF representation into a "monophonic" TF representation. This transform can be considered as a kind of solo voice enhancement. We evaluate our approach within a jazz solo retrieval scenario and compare it to a state-of-the-art method for predominant melody extraction.
In view of applying Music Information Retrieval (MIR) techniques for music production, our goal is to extract high-quality component signals from drum solo recordings (so-called breakbeats). Specifically, we employ audio source separation techniques to recover sound events from the drum sound mixture that correspond to the individual drum strokes. Our separation approach is based on an informed variant of Non- Negative Matrix Factor Deconvolution (NMFD) that has been proposed and applied to drum transcription and separation in earlier works. In this article, we systematically study the suitability of NMFD and the impact of audio- and score-based side information in the context of drum separation. In the case of imperfect decompositions, we observe different cross-talk artifacts appearing during the attack and the decay segment of the extracted drum sounds. Based on these findings, we propose and evaluate two extensions to the core technique. The first extension is based on applying a cascaded NMFD decomposition while retaining selected side information. The second extension is a time-frequency selective restoration approach using a dictionary of single note drum sounds. For all our experiments, we use a publicly available data set consisting of multi-track drum recordings and corresponding annotations that allows us to evaluate the source separation quality. Using this test set, we show that our proposed methods can lead to an improved quality of the component signals.
The automated analysis of vibrato in complex music signals is a highly challenging task. A common strategy is to proceed in a two-step fashion. First, a fundamental frequency (F0) trajectory for the musical voice that is likely to exhibit vibrato is estimated. In a second step, the trajectory is then analyzed with respect to periodic frequency modulations. As a major drawback, however, such a method cannot recover from errors made in the inherently difficult first step, which severely limits the performance during the second step. In this work, we present a novel vibrato analysis approach that avoids the first error-prone F0-estimation step. Our core idea is to perform the analysis directly on a signal's spectrogram representation where vibrato is evident in the form of characteristic spectro-temporal patterns. We detect and parameterize these patterns by locally comparing the spectrogram with a predefined set of vibrato templates. Our systematic experiments indicate that this approach is more robust than F0-based strategies.
Harmonic–percussive–residual (HPR) sound separation is a useful preprocessing tool for applications such as pitched instrument transcription or rhythm extraction. In this demo, we show results from a novel method that uses the structure tensor—a mathemathmatical tool known from image processing—to calculate predominant orientation angles in the magnitude spectrogram. This orientation information can be used to distinguish between harmonic, percussive, and residual signal components, even in the case of frequency modulated signals.
Given a music recording, the objective of music structure analysis is to identify important structural elements and to temporally segment the recording according to these elements. As an important technical tool, the concept of self-similarity matrices is of fundamental importance in computational music structure. In this demo, you find many examples of such matrices for recordings of the "Winterreise" (Winter Journey). This song cycle, which consistis of 24 songs for single voice (usually sung by a tenor or baritone) accompanied by a piano, was composed by Franz Schubert in 1827 (D 911, op. 89).
The Single Microphone Switcher is a demo for exploring the recordings of the individual microphones used in the Freischütz Multitrack Dataset recordings. It sketches how the microphones were positioned in the room. The interface provides the possibility to listen to the individual microphone recordings. Furthermore, the instrument activation matrix, provides a visualization that shows which instruments are currently active (black) or inactive (white) at the current playback position.
When recording a live musical performance, the different voices, such as the instrument groups or soloists of an orchestra, are typically recorded in the same room simultaneously, with at least one microphone assigned to each voice. However, it is difficult to acoustically shield the microphones. In practice, each one contains interference from every other voice. In this paper, we aim to reduce these interferences in multi-channel recordings to recover only the isolated voices. Following the recently proposed Kernel Additive Modeling framework, we present a method that iteratively estimates both the power spectral density of each voice and the corresponding strength in each microphone signal. With this information, we build an optimal Wiener filter, strongly reducing interferences. The trade-off between distortion and separation can be controlled by the user through the number of iterations of the algorithm. Furthermore, we present a computationally efficient approximation of the iterative procedure. Listening tests demonstrate the effectiveness of the method.
A swarm of bees buzzing “Let it be” by the Beatles or the wind gently howling the romantic “Gute Nacht” by Schubert – these are examples of audio mosaics as we want to create them. Given a target and a source recording, the goal of audio mosaicing is to generate a mosaic recording that conveys musical aspects (like melody and rhythm) of the target, using sound components taken from the source. In this work, we propose a novel approach for automatically generating audio mosaics with the objective to preserve the source’s timbre in the mosaic. Inspired by algorithms for non-negative matrix factorization (NMF), our idea is to use update rules to learn an activation matrix that, when multiplied with the spectrogram of the source recording, resembles the spectrogram of the target recording. However, when applying the original NMF procedure, the resulting mosaic does not adequately reflect the source’s timbre. As our main technical contribution, we propose an extended set of update rules for the iterative learning procedure that supports the development of sparse diagonal structures in the activation matrix. We show how these structures better retain the source’s timbral characteristics in the resulting mosaic.
The problem of extracting singing voice from music recordings has received increasing research interest in recent years. Many proposed decomposition techniques are based on one of the following two strategies. The first approach is to directly decompose a given music recording into one component for the singing voice and one for the accompaniment by exploiting knowledge about specific characteristics of singing voice. Procedures following the second approach disassemble the recording into a large set of fine-grained components, which are classified and reassembled afterwards to yield the desired source estimates. In this paper, we propose a novel approach that combines the strengths of both strategies. We first apply different audio decomposition techniques in a cascaded fashion to disassemble the music recording into a set of mid-level components. This decomposition is fine enough to model various characteristics of singing voice, but coarse enough to keep an explicit semantic meaning of the components. These properties allow us to directly reassemble the singing voice and the accompaniment from the components. Our objective and subjective evaluations show that this strategy can compete with state-of-the-art singing voice separation algorithms and yields perceptually appealing results.
In recent years, methods to decompose an audio signal into a harmonic and a percussive component have received a lot of interest and are frequently applied as a processing step in a variety of scenarios. One problem is that the computed components are often not of purely harmonic or percussive nature but also contain noise-like sounds that are neither clearly harmonic nor percussive. Furthermore, depending on the parameter settings, one often can observe a leakage of harmonic sounds into the percussive component and vice versa. In this paper we present two extensions to a state-of-the-art harmonic-percussive separation procedure to target these problems. First, we introduce a separation factor parameter into the decomposition process that allows for tightening separation results and for enforcing the components to be clearly harmonic or percussive. As second contribution, inspired by the classical sines+transients+noise (STN) audio model, this novel concept is exploited to add a third residual component to the decomposition which captures the sounds that lie in between the clearly harmonic and percussive sounds of the audio signal.
A major problem in time-scale modification (TSM) of music signals is that percussive transients are often perceptually degraded. To prevent this degradation, some TSM approaches try to explicitly identify transients in the input signal and to handle them in a special way. However, such approaches are problematic for two reasons. First, errors in the transient detection have an immediate influence on the final TSM result and, second, a perceptual transparent preservation of transients is by far not a trivial task. In this paper we present a TSM approach that handles transients implicitly by first separating the signal into a harmonic component as well as a percussive component which typically contains the transients. While the harmonic component is modified with a phase vocoder approach using a large frame size, the noise-like percussive component ismodified with a simple time-domain overlap-add technique using a short frame size, which preserves the transients to a high degree without any explicit transient detection.
The separation of different sound sources from polyphonic music recordings constitutes a complex task since one has to account for different musical and acoustical aspects. In the last years, various score-informed procedures have been suggested where musical cues such as pitch, timing, and track information are used to support the source separation process. In this paper, we discuss a framework for decomposing a given music recording into notewise audio events which serve as elementary building blocks. In particular, we introduce an interface that employs the additional score information to provide a natural way for a user to interact with these audio events. By simply selecting arbitrary note groups within the score a user can access, modify, or analyze corresponding events in a given audio recording. In this way, our framework not only opens up new ways for audio editing applications, but also serves as a valuable tool for evaluating and better understanding the results of source separation algorithms.