Neural Audio Decorrelation Using Generative Adversarial Networks

Carlotta Anemüller, Oliver Thiergart, and Emanuël A. P. Habets

Accepted for publication at IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023

Abstract

The spatial perception of a sound image is significantly influenced by the degree of correlation between the two audio signals received by the ears. Audio signal decorrelation is, therefore, a commonly used tool in various spatial audio processing applications. In this paper, we propose a novel approach to audio decorrelation using generative adversarial networks. As generator, we employ a convolutional neural network architecture that has been recently proposed for audio decorrelation. In contrast to previous work, the loss function is defined solely w.r.t. the input audio signal, a reference output signal is not required. This enables to specifically tailor the training procedure to the desired output signal properties and possibly outperform conventional decorrelation techniques. The proposed approach is compared to a state-of-the-art conventional decorrelation method by means of objective evaluations as well as through a listening test, considering a variety of signal types.

Audio Examples

The following examples correspond to the items included in the listening tests. The original audio files originate partly from the EBU SQAM CD [1] and the FSD50K dataset [2].

  • Original: unprocessed single-channel input signal
  • Proposed - mono: single-channel output signal of the proposed method
  • MPEG-I - mono: single-channel output signal of the state-of-the-art conventional decorrelation method described in [3]
  • Proposed - stereo: mid/side stereo output signal of the proposed method
  • MPEG-I - stereo: mid/side stereo output signal of the state-of-the-art conventional decorrelation method described in [3]

Music11

Music22

Castanets3

Violin4

Applause5

Speech6

Waves7

References

[1] “Sound quality assessment material recordings for subjective tests - users’ handbook for the EBU-SQAM compact disk, Tech. Rep. 3253-E, Apr. 1988.

[2] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An open dataset of human-labeled sound events,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 829–852, 2022.

[3] S. Disch, “Decorrelation for immersive audio applications and sound effects,” in Proc. DAFx-23, Copenhagen, Denmark, Sept. 2023.


1 Track 70 of the EBU SQAM CD.

2 Track 69 of the EBU SQAM CD.

3 Track 27 of the EBU SQAM CD.

4 Track 59 of the EBU SQAM CD.

5 Item 395414 of the FSD50K evaluation subset, uploader: debsound, license: http://creativecommons.org/licenses/by-nc/3.0/.

6 Track 49 of the EBU SQAM CD.

7 Item 161700 of the FSD50K evaluation subset, uploader: xserra, license: http://creativecommons.org/licenses/by/3.0/.