Improved Normalizing Flow-Based Speech Enhancement using an All-pole Gammatone Filterbank for Conditional Input Representation

Martin Strauss, Matteo Torcoli and Bernd Edler

accepted to SLT, 2022

Examples

Listening test items. The test data is the dataset published by Valentini et al. [1]. The comparing methods are MetricGAN+ [2] and CDiffuSE [3].

IDGenderNoiseTest input SNR [dB]Website
p232_003maleBus7.5Link
p232_219maleCafe7.5Link
p232_005maleBus2.5Link
p232_032maleCafe2.5Link
p257_186femaleBus7.5Link
p257_049femaleCafe7.5Link
p257_367femaleBus2.5Link
p257_251femaleCafe2.5Link

References

[1] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks,” in Proceedings Interspeech Conference, 2016, pp. 352–356.

[2] S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, "MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement", in Proceedings Interspeech Conference, 2021, pp. 201–205.

[3] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, "CONDITIONAL DIFFUSION PROBABILISTIC MODEL FOR SPEECH ENHANCEMENT", in Proceedings ICASSP, 2022, pp. 7402-7406