H. Nazim Bicer, Philipp Götz, Cagdas Tuna, and Emanuël A. P. Habets
Submitted to IWAENC-2022
This study presents a sonification method that provides ``audible explanations" to improve the transparency of the decision-making processes of convolutional neural networks designed for acoustic scene classification (ASC). First, a deep neural network (DNN) based on the ResNet architecture is proposed. Secondly, Grad-CAM and guided backpropagation images are computed for a given input signal. These images are then used to produce frequency-selective filters that retain signal components in the input that contribute to the decision of the trained DNN. The test results demonstrate that the proposed model outperforms two baseline models. The reconstructed audio waveform is interpretable by the human ear, serving as a valuable tool to examine and possibly improve ASC models.
Below are five listening examples from different acoustic scenes. For each scene, the input signal and the two sonification outputs, based on Grad-CAM and guided backpropagation, are presented.