AudioLabs - Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

Fabian-Robert Stöter, Soumitro Chakrabarty, Bernd Edler and Emanuël A. P. Habets

Abstract

The task of estimating the maximum number of concurrent speakers from single channel mixtures is important for various audio-based applications, such as blind source separation, speaker diarisation, audio surveillance or auditory scene classification. Building upon powerful machine learning methodology, we develop a Deep Neural Network (DNN) that estimates a speaker count. While DNNs efficiently map input representations to output targets, it remains unclear how to best handle the network output to infer integer source count estimates, as a discrete count estimate can either be tackled as a regression or a classification problem. In this paper, we investigate this important design decision and also address complementary parameter choices such as the input representation. We evaluate a state-of-the-art DNN audio model based on a Bi-directional Long Short-Term Memory network architecture for speaker count estimations. Through experimental evaluations aimed at identifying the best overall strategy for the task and show results for five seconds speech segments in mixtures of up to ten speakers.

Task

In this work we consider the task of estimating the maximum number of concurrent speakers, :: k \in \mathbb{Z}^{+}_{0}::, in a single channel audio mixture ::\mathbf{x}::. Our proposed system utilizes a deep neural network architecture to learn a mapping from ::\mathbf{x}:: to ::k::. Note, that our task is different to estimating the number of speakers ::l=1..L::. Naturally, not all speakers are active at every time instance. Our proposed task of estimating ::k\leq L:: is more closely related to source separation whereas the estimation of ::L:: itself is more useful for tasks where speakers only overlap (speaker diarisation). We assume that no additional prior information except the maximum number of speakers, ::L::, is available, representing an upper limit for estimation. We illustrate our setup in a “cocktail-party” scenario featuring ::L=3:: speakers in the figure next to this paragraph.

Demo

In this video we present the result of eleven random samples of five seconds duration from our test data, where each sample consists of a unique set of speakers. Also we use a sliding window scheme, so that some samples contain two sets of speakers, e.g. first part with two speakers and second part with speakers. In the evaluation of our paper, however, we only evaluate on samples with a homogenous set of speakers, as indicated by the large dots in the animated plot. Further results, details and parameters associated to the model and implementation are provided in the paper.

Pre-Trained Estimator

Keras 2 Model + Preprocessing Code

LibriCount: Dataset

the dataset contains a simulated cocktail party environment of [0..10] speakers, mixed with 0dB SNR from random utterances of different speakers from the LibriSpeech CleanTest dataset.

For each recording we provide the ground truth number of speakers within the file name, where k in, k_uniquefile.wav is the maximum number of concurrent speakers with the 5 seconds of recording.

All recordings are of 5s durations. For each unique recording, we provide the audio wave file (16bits, 16kHz, mono) and an annotation json file with the same name as the recording.

Dataset	Cardinality	Number of Samples	Duration	Download Size	Download
Test Data	0-10 Speakers	5720 Samples	8h	900MB	Zenodo
Sample Data	1-10 Speakers	100 Samples	500s	20 MB	t.b.a.

The data is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. If you want to use this dataset in your academic research, please cite our.

Metadata

In the annotation file we provide information about the speakers sex, their unique speaker_id, and vocal activity within the mixture recording in samples. Note that these were automatically generated using a voice activity detection method.

In the following example a speaker count of 3 speakers is the ground truth.

[
    {
        "sex": "F",
        "activity": [[0, 51076], [51396, 55400], [56681, 80000]], "speaker_id": 1221
    },
    {
        "sex": "F",
        "activity": [[0, 51877], [56201, 80000]],
        "speaker_id": 3570
    },
    {
        "sex": "M",
        "activity": [[0, 15681], [16161, 68213], [73498, 80000]], "speaker_id": 5105
    }
]

International Audio Laboratories Erlangen