Carlotta Anemüller, Oliver Thiergart, and Emanuël A. P. Habets
Published in IEEE Signal Processing Letters
The degree of correlation between two audio signals entering the ears is known to have a significant impact on the spatial perception of a sound image. Audio signal decorrelation is therefore a widely used tool in various applications within the field of spatial audio processing. This paper explores for the first time the use of a data-driven approach for audio decorrelation. We propose a convolutional neural network architecture that is trained with the help of a state-of-the-art reference decorrelator. The proposed approach is evaluated using music and applause signals by means of objective evaluations as well as through a listening test. The proposed approach can serve as a proof of concept to address common limitations of existing decorrelation techniques in future work, which include introduction of temporal smearing and coloration artifacts and the production of a limited number of mutually uncorrelated output signals.
The following examples are a subset of the items included in the listening test. The original audio files all originate from the evaluation subset of the FSD50K dataset [1].
Guitar1
Piano2
Dense applause3
Sparse applause4
[1] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: an open dataset of human-labeled sound events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2022, publisher: IEEE.
[2] ISO 23090-4, “MPEG-I Immersive Audio,” WD0, 2022.
1 Item 121007 of the FSD50K evaluation subset, uploader: Thirsk, license: https://creativecommons.org/licenses/by/3.0/.
2 Item 319585 of the FSD50K evaluation subset, uploader: visual, license: https://creativecommons.org/licenses/by/3.0/.
3 Item 1923 of the FSD50K evaluation subset, uploader: RHumphries, license: https://creativecommons.org/licenses/by/3.0/.
4 Item 395414 of the FSD50K evaluation subset, uploader: debsound, license: http://creativecommons.org/licenses/by-nc/3.0/.