AudioLabs - Lecture: Selected Topics in Deep Learning for Audio, Speech, and Music Processing (Summer Term 2021)

Lecture: Selected Topics in Deep Learning for Audio, Speech, and Music Processing (Summer Term 2021)

Instructors: Prof. Dr. ir. Emanuël Habets, Prof. Dr. Meinard Müller
Credits: 2.5 ECTS
Time (Lecture): Summer Term 2021, Monday, 16:00–17:45 (1. Lecture: 19.04.2021, via ZOOM)
Link and access information for our ZOOM meetings can be found at StudOn (see below).
Exam (graded): Oral examination at the end of term
Dates (Lecture)): Mo 19.04.2021, Mo 26.04.2021, Mo 03.05.2021, Mo 10.05.2021, Mo 17.05.2021, Mo 31.05.2021, Mo 07.06.2021, Mo 14.06.2021, Mo 21.06.2021, Mo 28.06.2021, Mo 05.07.2021, Mo 12.07.2021,
Examination Dates (Room 3R4.03): To be announced

Important Notes:

Due to the COVID-19 pandemic, the lecture Selected Topics of Deep Learning for Audio, Speech, and Music Processing will be offered as a fully virtual course (via ZOOM).
Participation in the ZOOM session is only possible for FAU students. The ZOOM access information for this course will be made available via StudOn. Therefore, you must register via StudOn prior to the first lecture.
This course will based on articles from the research literature. It is strongly adviced that students prepare for the lecture by reading these articles. The lecture time will be used for an introduction to the respective problem, the deepening of important technical aspects, and for having a question–answering dialogue with participants.
As a technical requirement, all participants must have access to a computer capable of running the ZOOM video conferencing software (as provided by FAU), including audio and video transmission as well as screensharing.
To ensure privacy, the ZOOM sessions will not be recorded. Also, participants are not permitted to record the ZOOM sessions. Furthermore, ZOOM links may not be distributed.

Content

Many recent advances in audio, speech, and music processing have been driven by techniques based on deep learning (DL). For example, DL-based techniques have led to significant improvements in, for example, speaker separation, speech synthesis, acoustic scene analysis, audio retrieval, chord recognition, melody estimation, and beat tracking. Considering specific audio, speech, and music processing tasks, we study various DL-based approaches and their capability to extract complex features and make predictions based on hidden structures and relations. Rather than giving a comprehensive overview, we will study selected and generally applicable DL-based techniques. Furthermore, in the context of challenging application scenarios, we will critically review the potential and limitations of recent deep learning techniques. As one main general objective of the lecture, we want to discuss how you can integrate domain knowledge into neural network architectures to obtain explainable models that are less vulnerable to data biases and confounding factors.

The course consists of two overview-like lectures, where we introduce current research problems in audio, speech, and music processing. We will then continue with 6 to 8 lectures on selected audio processing topics and DL-based techniques. Being based on articles from the research literature, we will provide detailed explanations covered in mathematical depth; we may also try to attract some of the original authors to serve as guest lecturers. Finally, we round off the course by a concluding lecture covering practical aspects (e.g., hardware, software, version control, reproducibility, datasets) that are relevant when working with DL-based techniques.

Course Requirements

In this course, we require a good knowledge of deep learning techniques, machine learning, and pattern recognition as well as a strong mathematical background. Furthermore, we require a solid background in general digital signal processing and some experience with audio, image, or video processing.

It is recommended to finish the following modules (or having equivalent knowledge) before starting this module:

Examination

There will be oral examinations (30 minutes) either in July or October. In the exam, you should be to able to summarize the lectures' content and to answer general questions as listed below. Additionally, you need to pick one of the lectures (Lecture 3 to Lecture 9) as your in-depth topic, where you should be able to answer detailed technical questions about the specified papers. For further details and appointments, please see check StudOn.

Lecture: Topics, Material, Instructions

The course consists of two overview-like lectures, where we introduce current research problems in audio, speech, and music processing. We will then continue with 6 to 8 lectures wich are based on articles from the research literature. The lecture material includes handouts of slides, links to the original articles, and possibly links to demonstrators and further online resources. In the following list, you find links to the material. If you have any questions regarding the lecture, please contact Prof. Dr. ir. Emanuël Habets and Prof. Dr. Meinard Müller.

The following tentative schedule gives an overview:

Lecture 1: Introduction to Audio and Speech Processing

Date: Monday, 19.04.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Emanuël Habets
Slides: Available at StudOn for registered students
Questions:
- What are the building blocks of a human-machine interface?
- What is the goal of speech enhancement? Why is this a challenging task?
- What are the common processing steps for speech enhancement and speaker extraction?
- What are learning strategies to obtain a mask that can be used to enhance a noisy signal?
- Why are data-driven approaches so effective in solving speech and audio processing tasks?

Lecture 2: Introduction to Music Processing

Date: Monday, 26.04.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Meinard Müller
Slides (PDF), Handouts (6 slides per page) (PDF)
Questions:
- What is a piano roll representation? What does it represent?
- What is the objective of music synchronization?
- What is the objective of tempo estimation and beat tracking?
- What are the steps for computing the spectral flux? What is it good for?
- How can one simulate the various steps using deep learning operations?
- What is the goal of music transcription?
- What is the goal of music source separation? Why is it challenging?

Meinard Müller
Fundamentals of Music Processing
Springer Verlag, 2015.

@book{Mueller15_FMP_SPRINGER,
author    = {Meinard M{\"u}ller},
title     = {Fundamentals of Music Processing},
type      = {Monograph},
year      = {2015},
publisher = {Springer Verlag}
}

Meinard Müller and Frank Zalkow
FMP Notebooks: Educational Material for Teaching and Learning Fundamentals of Music Processing
In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR): 573–580, 2019. DOI

@inproceedings{MuellerZ19_FMP_ISMIR,
author    = {Meinard M{\"u}ller and Frank Zalkow},
title     = {{FMP} {N}otebooks: {E}ducational Material for Teaching and Learning Fundamentals of Music Processing},
booktitle = {Proceedings of the International Society for Music Information Retrieval Conference ({ISMIR})},
address   = {Delft, The Netherlands},
pages     = {573--580},
year      = {2019},
doi       = {10.5281/zenodo.3527872}
}

Lecture 3: Permutation Invariant Training Techniques for Speech Separation

Date: Monday, 03.05.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Wolfgang Mack, Emanuël Habets
Slides: Available at StudOn for registered students
Exam: Paper 1 (Hershey et a.) and Paper 3 (Kolbaek et al.)
Questions:
- Why is deep learning suited for audio source separation?
- How do you determine the number of possible permutations in permutation-invariant training for speaker separation?
- In permutation-invariant training, how do you define the loss used to update the neural network?
- Why is the concept of W-orthogonality important for deep clustering?
- In deep clustering, should embeddings of time-frequency bins that are dominated by different speakers be orthogonal?
- In deep attractor networks, how are the attractors and embeddings processed to obtain ratio masks for speaker separation?
- What are possible methods (name at least three) to address the permutation problem in speaker-speaker separation?
- In deep clustering, how are binary masks obtained from embeddings?

John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe
Deep clustering: Discriminative embeddings for segmentation and separation
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 31–35, 2016. DOI

@inproceedings{HersheyCRW16_DeepClustering_ICASSP,
author    = {John R. Hershey and Zhuo Chen and Jonathan Le Roux and Shinji Watanabe},
title     = {Deep clustering: Discriminative embeddings for segmentation and separation},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages     = {31--35},
year      = {2016},
doi       = {10.1109/ICASSP.2016.7471631}
}

Zhuo Chen, Yi Luo, and Nima Mesgarani
Deep attractor network for single-microphone speaker separation
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 246–250, 2017. DOI

@inproceedings{ChenLM17_DeepAttractor,
author    = {Zhuo Chen and Yi Luo and Nima Mesgarani},
title     = {Deep attractor network for single-microphone speaker separation},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages     = {246--250},
year      = {2017},
doi       = {10.1109/ICASSP.2017.7952155}
}

Morten Kolbaek, Dong Yu, Zheng-Hua Tan, and Jesper Jensen
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(10): 1901–1913, 2017. DOI

@article{KolbaekYTJ17_SpeechSep_TASLP,
author    = {Morten Kolbaek and Dong Yu and Zheng-Hua Tan and Jesper Jensen},
title     = {Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks},
journal   = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing},
volume    = {25},
number    = {10},
pages     = {1901--1913},
year      = {2017},
url       = {https://doi.org/10.1109/TASLP.2017.2726762},
doi       = {10.1109/TASLP.2017.2726762}
}

Ilya Kavalerov, Scott Wisdom, Hakan Erdogan, Brian Patton, Kevin W. Wilson, Jonathan Le Roux, and John R. Hershey
Universal Sound Separation
In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA): 175–179, 2019. DOI

@inproceedings{KavalerovWEPWRH19_UniversalSoundSep_WASPAA,
author    = {Ilya Kavalerov and Scott Wisdom and Hakan Erdogan and Brian Patton and Kevin W. Wilson and Jonathan Le Roux and John R. Hershey},
title     = {Universal Sound Separation},
booktitle = {Proceedings of the {IEEE} Workshop on Applications of Signal Processing to Audio and Acoustics ({WASPAA})},
pages     = {175--179},
year      = {2019},
url       = {https://doi.org/10.1109/WASPAA.2019.8937253},
doi       = {10.1109/WASPAA.2019.8937253}
}

Lecture 4: Deep Clustering for Single-Channel Ego-Noise Suppression

Date: Monday, 10.05.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Annika Briegleb
Slides: Available at StudOn for registered students
Exam: Paper 1 (Briegleb et a.) and Paper 2 (Hershey et al.)
Questions:
- What is the goal of ego-noise supression?
- What is an ideal binary mask? What is an ideal ratio mask?
- What is the idea of deep clustering? What is the input? What is the embedding space? What are the targets?
- How is the cost function for deep clustering defined (affinity loss)?
- How is deep clustering applied at the test stage?

Annika Briegleb, Alexander Schmidt, and Walter Kellermann
Deep Clustering for Single-Channel Ego-Noise Suppression
In Proceedings of the International Congress on Acoustics (ICA): 2813–2820, 2019. PDF DOI

@inproceedings{BrieglebSK01_EgeNoise_ICA,
author    = {Annika Briegleb and Alexander Schmidt and Walter Kellermann},
title     = {Deep Clustering for Single-Channel Ego-Noise Suppression},
booktitle = {Proceedings of the International Congress on Acoustics ({ICA})},
pages     = {2813--2820},
year      = {2019},
doi       = {10.18154/RWTH-CONV-239374},
url-pdf    = {https://pub.dega-akustik.de/ICA2019/data/articles/000705.pdf}
}

@inproceedings{HersheyCRW16_DeepClustering_ICASSP,
author    = {John R. Hershey and Zhuo Chen and Jonathan Le Roux and Shinji Watanabe},
title     = {Deep clustering: Discriminative embeddings for segmentation and separation},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages     = {31--35},
year      = {2016},
doi       = {10.1109/ICASSP.2016.7471631}
}

@inproceedings{ChenLM17_DeepAttractor,
author    = {Zhuo Chen and Yi Luo and Nima Mesgarani},
title     = {Deep attractor network for single-microphone speaker separation},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages     = {246--250},
year      = {2017},
doi       = {10.1109/ICASSP.2017.7952155}
}

Yi Luo, Zhuo Chen, and Nima Mesgarani
Speaker-Independent Speech Separation With Deep Attractor Network
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4): 787–796, 2018. DOI

@article{LuoCM18_DeepAttractorNetwork_TASLP,
author    = {Yi Luo and Zhuo Chen and Nima Mesgarani},
title     = {Speaker-Independent Speech Separation With Deep Attractor Network},
journal   = {{IEEE/ACM} Transactions on Audio, Speech, and Language Processing},
volume    = {26},
number    = {4},
pages     = {787--796},
year      = {2018},
url       = {https://doi.org/10.1109/TASLP.2018.2795749},
doi       = {10.1109/TASLP.2018.2795749}
}

Lecture 5: Music Source Separation

Date: Monday, 17.05.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Fabian-Robert Stöter
Slides: Available at StudOn for registered students
Exam: Paper 1 (Jansson et al.) and Paper 2 (Stöter et al.)
Questions:
- What are applications for music source separation?
- How can one formalize the task of music source separation?
- What is the basic architecture of a masking-based source separation approach? What can be used as input representation? What is the output of such an approach? How does one obtain the separated signals?
- What is the main idea of a U-net architecture? What is the role of downsampling and upsampling? What is the idea of transposed convolution?
- What is the role of skip connections?
- What are issues when using magnitude spectrograms in source separation?

Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde
Singing Voice Separation with Deep U-Net Convolutional Networks
In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR): 745–751, 2017. DOI

@inproceedings{JanssonHMBKW17_SingingSep_ISMIR,
author    = {Andreas Jansson and Eric J. Humphrey and Nicola Montecchio and Rachel M. Bittner and Aparna Kumar and Tillman Weyde},
title     = {Singing Voice Separation with Deep {U}-{N}et Convolutional Networks},
booktitle = {Proceedings of the International Society for Music Information Retrieval Conference ({ISMIR})},
pages     = {745--751},
year      = {2017},
url       = {https://ismir2017.smcnus.org/wp-content/uploads/2017/10/171\_Paper.pdf},
doi       = {10.5281/zenodo.1414934}
}

Fabian-Robert Stöter, Stefan Uhlich, Antoine Liutkus, and Yuki Mitsufuji
Open-Unmix — A Reference Implementation for Music Source Separation
Journal of Open Source Software (JOSS), 4(41): 1667, 2019. DOI

@article{StoterULM19_Unmix_JOSS,
author    = {Fabian{-}Robert St{\"{o}}ter and Stefan Uhlich and Antoine Liutkus and Yuki Mitsufuji},
title     = {{Open-Unmix} -- {A} Reference Implementation for Music Source Separation},
journal   = {Journal of Open Source Software ({JOSS})},
volume    = {4},
number    = {41},
pages     = {1667},
year      = {2019},
url       = {https://doi.org/10.21105/joss.01667},
doi       = {10.21105/joss.01667}
}

Olaf Ronneberger, Philipp Fischer, and Thomas Brox
U-Net: Convolutional Networks for Biomedical Image Segmentation
In Proceedings of Medical Image Computing and Computer-Assisted Intervention (MICCAI): 234–241, 2015. DOI

@inproceedings{RonnebergerFB15_UNet_LNCS,
author    = {Olaf Ronneberger and Philipp Fischer and Thomas Brox},
editor    = {Nassir Navab and
Joachim Hornegger and William M. Wells III and Alejandro F. Frangi},
title     = {{U}-{N}et: {C}onvolutional Networks for Biomedical Image Segmentation},
booktitle = {Proceedings of Medical Image Computing and Computer-Assisted Intervention ({MICCAI})},
series    = {Lecture Notes in Computer Science},
volume    = {9351},
pages     = {234--241},
publisher = {Springer},
year      = {2015},
url       = {https://doi.org/10.1007/978-3-319-24574-4_28},
doi       = {10.1007/978-3-319-24574-4_28}
}

Augustus Odena, Vincent Dumoulin, and Chris Olah
Deconvolution and Checkerboard Artifacts
Distill, 1(10), 2016. DOI

@article{OdenaDO16_deconvolution_Destill,
author = {Augustus Odena and Vincent Dumoulin and Chris Olah},
title = {Deconvolution and Checkerboard Artifacts},
journal = {Distill},
year = {2016},
volume    = {1},
number    = {10},
url = {http://distill.pub/2016/deconv-checkerboard},
doi = {10.23915/distill.00003}
}

Lecture 6: Nonnegative Autoencoders with Applications to Music Audio Decomposing

Date: Monday, 31.05.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Meinard Müller, Yigitcan Özer
Slides (PDF), Handouts (6 slides per page) (PDF)
Exam: Paper 1 (Smaragdis et al.) and Paper 2 (Ewert and Sandler)
Questions:
- What is the general pipeline for score-informed audio decomposition?
- What is NMF? How can NMF be formulated as an optimization problem?
- How can NMF be used for spectrogram decomposition? What is the interpretation?
- How does one obtain multiplicative from addititive update rules?
- How can one integrate template and activation constraints into NMF?
- What is the general idea of an autoencoder?
- How can one simulate NMF as an autoencoder?
- How can one enfore nonnegativity in autoencoders?
- How can one integrate score-informed constraints into autoencoders?
- What is the idea of projected gradient descent?
- What is the general idea of structured dropout?

Paris Smaragdis and Shrikant Venkataramani
A neural network alternative to non-negative audio models
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 86–90, 2017. DOI

@inproceedings{SmaragdisV17_NMFAutoencoder_ICASSP,
author    = {Paris Smaragdis and Shrikant Venkataramani},
title     = {A neural network alternative to non-negative audio models},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
address   = {New Orleans, Louisiana, USA},
pages     = {86--90},
year      = {2017},
url       = {https://doi.org/10.1109/ICASSP.2017.7952123},
doi       = {10.1109/ICASSP.2017.7952123}
}

Sebastian Ewert and Mark B. Sandler
Structured Dropout for Weak Label and Multi-Instance Learning and Its Application to Score-Informed Source Separation
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 2277–2281, 2017. DOI

@inproceedings{EwertS17_StructuredDropout_ICASSP,
author    = {Sebastian Ewert and Mark B. Sandler},
title     = {Structured Dropout for Weak Label and Multi-Instance Learning and Its Application to Score-Informed Source Separation},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
address   = {New Orleans, Louisiana, USA},
pages     = {2277--2281},
year      = {2017},
url       = {https://doi.org/10.1109/ICASSP.2017.7952562}
doi       = {10.1109/ICASSP.2017.7952562}
}

Sebastian Ewert and Meinard Müller
Using Score-Informed Constraints for NMF-based Source Separation
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 129–132, 2012. Details

@inproceedings{EwertM12_ScoreInformedNMF_ICASSP,
author    = {Sebastian Ewert and Meinard M{\"u}ller},
title     = {Using Score-Informed Constraints for {NMF}-based Source Separation},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
address   = {Kyoto, Japan},
year      = {2012},
pages   = {129--132},
month     = {March},
url-details = {http://resources.mpi-inf.mpg.de/MIR/ICASSP2012-ScoreInformedNMF/}
}

Lecture 7: Attention in Sound Source Localization and Speaker Extraction

Date: Monday, 07.06.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Mohamed Elminshawi, Wolfgang Mack, Emanuël Habets
Slides: Available at StudOn for registered students
Exam: Paper 1 (Zmoliková et al.) and Paper 2 (Mack et al.)
Questions:
- Which kind of auxiliary information may be exploited for extracting a target speaker?
- How is speaker information extracted from auxiliary information?
- In case auxiliary information is given in form of an audio signal, what is the minimum duration in order to be effective?
- When are time-variant embeddings are advantageous over time-invariant embeddings?
- What additional step is necessary if the auxiliary information (e.g., video) has a different sampling rate compared to the mixture signal? Where is this additional step applied?
- What is a suitable general structure of an extraction system that uses auxiliary information?
- Why is the phase an important feature for DOA estimation?
- Why can noise sources be used to simulate training data?
- How can spectral information be used in addition to spatial information (phase) to enable signal-aware DOA estimation.
- What is the minimum number of microphones you need to perform DOA estimation?

Katerina Zmoliková, Marc Delcroix, Keisuke Kinoshita, Tsubasa Ochiai, Tomohiro Nakatani, Lukás Burget, and Jan Cernocky
SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures
IEEE Journal on Selected Topics in Signal Processing, 13(4): 800–814, 2019. DOI

@article{ZmolikovaDKONBC19_SpeakerBeam_JSTSP,
author    = {Katerina Zmolikov{\'{a}} and Marc Delcroix and Keisuke Kinoshita and Tsubasa Ochiai and Tomohiro Nakatani and Luk{\'{a}}s Burget and Jan Cernocky},
title     = {{SpeakerBeam}: {S}peaker Aware Neural Network for Target Speaker Extraction
in Speech Mixtures},
journal   = {{IEEE} Journal on Selected Topics in Signal Processing},
volume    = {13},
number    = {4},
pages     = {800--814},
year      = {2019},
url       = {https://doi.org/10.1109/JSTSP.2019.2922820},
doi       = {10.1109/JSTSP.2019.2922820}
}

Wolfgang Mack, Ullas Bharadwaj, Soumitro Chakrabarty, and Emanuël A. P. Habets
Signal-Aware Broadband DOA Estimation Using Attention Mechanisms
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 4930–4934, 2020. DOI

@inproceedings{MackBCH20_DOA_ICASSP,
author    = {Wolfgang Mack and Ullas Bharadwaj and Soumitro Chakrabarty and Emanu{\"{e}}l A. P. Habets},
title     = {Signal-Aware Broadband {DOA} Estimation Using Attention Mechanisms},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages     = {4930--4934},
publisher = {{IEEE}},
year      = {2020},
url       = {https://doi.org/10.1109/ICASSP40776.2020.9053658},
doi       = {10.1109/ICASSP40776.2020.9053658}
}

Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Atsunori Ogawa, and Tomohiro Nakatani
Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues
In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech): 2718–2722, 2019. DOI

@inproceedings{OchiaiDKON19_SpeakerBeam_Interspeech,
author    = {Tsubasa Ochiai and Marc Delcroix and Keisuke Kinoshita and Atsunori Ogawa and Tomohiro Nakatani},
title     = {Multimodal {SpeakerBeam}: {S}ingle Channel Target Speech Extraction with
Audio-Visual Speaker Clues},
booktitle = {Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech)},
pages     = {2718--2722},
publisher = {{ISCA}},
year      = {2019},
url       = {https://doi.org/10.21437/Interspeech.2019-1513},
doi       = {10.21437/Interspeech.2019-1513}
}

Ziteng Wang, Junfeng Li, and Yonghong Yan
Target Speaker Localization Based on the Complex Watson Mixture Model and Time-Frequency Selection Neural Network
Applied Sciences, 8(11), 2018. PDF

@article{WangLY18_SpeakerLoc_AppliedSciences,
author = {Ziteng Wang and Junfeng Li and Yonghong Yan},
title = {Target Speaker Localization Based on the Complex {W}atson Mixture Model and Time-Frequency Selection Neural Network},
journal = {Applied Sciences},
volume = {8},
year = {2018},
number = {11},
url-pdf = {https://www.mdpi.com/2076-3417/8/11/2326}
}

Lecture 8: Recurrent and Generative Adversarial Network Architectures for Text-to-Speech

Date: Monday, 14.06.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Nicola Pia, Christian Dittmar
Slides: Available at StudOn for registered students
Exam: Paper 3 (Mustafa et al.) and Paper 4 (Wang et al.)
Questions:
- What is the task of text-to-speech (TTS) synthesis? Why is TTS challenging?
- What are traditional and recent speech synthesis methods?
- What is the typical processing pipeline of a state-of-the-art TTS system?
- What is the role of the acoustic model? What is the role of the neural vocoder?
- What is the advantage of using phoneme embeddings compared to one-hot encodings?
- Which types of sequence-to-sequence models exist?
- How can one map sequences of different lengths?
- What is the goal of duration prediction?
- What is the main idea of a GAN? How is this idea applied in TTS synthesis?
- What is the input of the StyleMelGAN approach? What is its output?

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ-Skerrv Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 4779–4783, 2018. DOI

@inproceedings{ShenPWSJYCZWRSA18_TTS_ICASSP,
author    = {Jonathan Shen and Ruoming Pang and Ron J. Weiss and Mike Schuster and Navdeep Jaitly and Zongheng Yang and Zhifeng Chen and Yu Zhang and Yuxuan Wang and RJ-Skerrv Ryan and Rif A. Saurous and Yannis Agiomyrgiannakis and Yonghui Wu},
title     = {Natural {TTS} Synthesis by Conditioning Wavenet on {MEL} Spectrogram Predictions},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages     = {4779--4783},
year      = {2018},
url       = {https://doi.org/10.1109/ICASSP.2018.8461368},
doi       = {10.1109/ICASSP.2018.8461368}
}

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu
FastSpeech: Fast, Robust and Controllable Text to Speech
In Proceedings of the Annual Conference on Neural Information Processing Systems: 3165–3174, 2019. PDF

@inproceedings{RenRTQZZL19_FastSpeech_NeurIPS,
author    = {Yi Ren and Yangjun Ruan and Xu Tan and Tao Qin and Sheng Zhao and Zhou Zhao and Tie-Yan Liu},
title     = {{FastSpeech}: {F}ast, Robust and Controllable Text to Speech},
booktitle = {Proceedings of the Annual Conference on Neural Information Processing Systems},
pages     = {3165--3174},
year      = {2019},
url-pdf   = {https://proceedings.neurips.cc/paper/2019/file/f63f65b503e22cb970527f23c9ad7db1-Paper.pdf},
}

Ahmed Mustafa, Nicola Pia, and Guillaume Fuchs
StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 6034–6038, 2021. PDF DOI

@inproceedings{MustafaPF21_StyleMelGAN_ICASSP,
author={Ahmed Mustafa and Nicola Pia and Guillaume Fuchs},
booktitle={Proceedings of the {IEEE} International Conference on Acoustics, Speech and Signal Processing ({ICASSP})},
title={{StyleMelGAN}: {A}n Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization},
year={2021},
volume={},
number={},
pages={6034--6038},
doi={10.1109/ICASSP39728.2021.9413605},
url-pdf   = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9413605},
}

Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous
Tacotron: Towards End-to-End Speech Synthesis
In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech): 4006–4010, 2017. PDF

@inproceedings{WangSSWWJYXCBLA17_Tacotron_Interspeech,
author    = {Yuxuan Wang and R. J. Skerry-Ryan and Daisy Stanton and Yonghui Wu and Ron J. Weiss and Navdeep Jaitly and Zongheng Yang and Ying Xiao and Zhifeng Chen and Samy Bengio and Quoc V. Le and Yannis Agiomyrgiannakis and Rob Clark and Rif A. Saurous},
title     = {{Tacotron}: {T}owards End-to-End Speech Synthesis},
booktitle = {Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech)},
pages     = {4006--4010},
publisher = {{ISCA}},
year      = {2017},
url-pdf   = {https://www.isca-speech.org/archive/Interspeech_2017/pdfs/1452.PDF}
}

Lecture 9: Connectionist Temporal Classification (CTC) Loss with Applications to Theme-Based Music Retrieval

Date: Monday, 21.06.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: Frank Zalkow, Meinard Müller
Slides (PDF), Handouts (6 slides per page) (PDF)
Exam: Paper 1 (Zalkow et al.) and Paper 3 (Graves et al.)
Questions:
- In the speech recognition application, what are the data sequences to be aligned?
- What is the difference between a strong alignment and a weak alignment?
- How can an alignment be represented?
- What is the CTC loss good for? What is its role in training a speech recognition system?
- Given a label sequence Y=(b,c,d), what are the valid alignments A=(a1,a2,a3,a4) of length 4?
- How is the reduction function (kappa) defined?
- Given a feature sequence X and an alignment A, how is the probability p(A|X) defined?
- What is the set of all valid alignments (for a given label sequence Y)?
- How is the CTC loss defined?
- Why is the CTC loss differentiable?

Frank Zalkow and Meinard Müller
Using Weakly Aligned Score—Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval
In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR): 184–191, 2020. DOI

@inproceedings{ZalkowM20_WeaklyAlignedTrain_ISMIR,
author    = {Frank Zalkow and Meinard M{\"u}ller},
title     = {Using Weakly Aligned Score--Audio Pairs to Train Deep Chroma Models for Cross-Modal Music Retrieval},
booktitle = {Proceedings of the International Society for Music Information Retrieval Conference ({ISMIR})},
address   = {Montr{\'{e}}al, Canada},
pages     = {184--191},
year      = {2020},
doi       = {10.5281/zenodo.4245400}
}

Daniel Stoller, Simon Durand, and Sebastian Ewert
End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-To-Character Recognition Model
In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 181–185, 2019. DOI

@inproceedings{StollerDE19_LyricsAlignment_ICASSP,
author    = {Daniel Stoller and Simon Durand and Sebastian Ewert},
title     = {End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-To-Character Recognition Model},
booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})},
pages     = {181--185},
address   = {Brighton, {UK}},
year      = {2019},
doi       = {10.1109/ICASSP.2019.8683470}
}

Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
In Proceedings of the International Conference on Machine Learning (ICML): 369–376, 2006. DOI

@inproceedings{GravesFGS06_CTCLoss_ICML,
author    = {Alex Graves and Santiago Fern{\'{a}}ndez and Faustino J. Gomez and J{\"{u}}rgen Schmidhuber},
title     = {Connectionist Temporal Classification: {L}abelling Unsegmented Sequence Data with Recurrent Neural Networks},
booktitle = {Proceedings of the International Conference on Machine Learning ({ICML})},
pages     = {369--376},
address   = {Pittsburgh, Pennsylvania, USA},
year      = {2006},
doi       = {10.1145/1143844.1143891}
}

Lecture 10: From Theory to Practise

Date: Monday, 28.06.2021, Start: 16:00 (zoom opens 15:50)
Lecturer: TBA

International Audio Laboratories Erlangen