Improved Normalizing Flow-Based Speech Enhancement using an All-pole Gammatone Filterbank for Conditional Input Representation

Martin Strauss, Matteo Torcoli and Bernd Edler

presented at SLT, 2022


Listening test items. The test data is the dataset published by Valentini et al. [1]. The comparing methods are MetricGAN+ [2] and CDiffuSE [3].

ID Gender Noise Test input SNR [dB] Website
p232_003 male Bus 7.5 Link
p232_219 male Cafe 7.5 Link
p232_005 male Bus 2.5 Link
p232_032 male Cafe 2.5 Link
p257_186 female Bus 7.5 Link
p257_049 female Cafe 7.5 Link
p257_367 female Bus 2.5 Link
p257_251 female Cafe 2.5 Link


[1] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks,” in Proceedings Interspeech Conference, 2016, pp. 352–356.

[2] S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, "MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement", in Proceedings Interspeech Conference, 2021, pp. 201–205.

[3] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, "CONDITIONAL DIFFUSION PROBABILISTIC MODEL FOR SPEECH ENHANCEMENT", in Proceedings ICASSP, 2022, pp. 7402-7406