J. Wechsler, W. Mack and E. A. P. Habets
Submitted to EUSIPCO.
The direction-of-arrival (DOA) of acoustic sources is an important parameter used in multichannel acoustic signal processing (ASP) to perform, e.g., source extraction. To make DOA estimators signal-aware, i.e., to localize only the sources of interest (SOIs) and disregard other sources, deep learning (DL)-based time-frequency masking has been used widely. The masking, thereby, is applied to feature representations of the microphone signals. DOA estimators can either be model-based or DL-based, such that the combination with the DL-based masking estimator can either be hybrid or fully data-driven. Although fully data-driven systems can be trained end-to-end, existing training losses for hybrid systems like weighted steered-response power require ground-truth (GT) microphone signals, i.e., signals containing only the SOIs. In this work, we propose a loss function that enables training hybrid DOA estimation systems end-to-end using the noisy microphone signals and the GT DOAs of the SOIs, and hence does not dependent on the GT signals. We show that weighted steered-response power trained using the proposed loss performs on par with weighted steered-response power trained using an existing loss that depends on the GT microphone signals. End-to-end training yields consistent performance irrespective of the explicit application of phase transform weighting.
Note that the SPS-based masking  and the proposed masking are not devised for source extraction.
 Z. Wang, X. Zhang, and D. Wang, “Robust speaker localization guided by deep learning-based time-frequency masking,” IEEE/ACM Trans. Aud., Sp., Lang. Proc., vol. 27, no. 1, pp. 178–188, 2019.
 W. Mack, J. Wechsler, and E. A. P. Habets, “End-to-end signal-aware direction-of-arrival estimation using attention mechanisms,” Computer Speech & Language, vol. 75, p. 101363, 2022. [Online]. Available: https://doi.org/10.1016/j.csl.2022.101363