Dynamic Slimmable Networks for Efficient Speech Separation

M. Elminshawi, S. Chetupalli, and E. A. P. Habets

Submitted to TASLP.

Abstract

Recent progress in speech separation has mainly been driven by advances in deep neural networks. However, deploying such networks on resource-constrained devices remains challenging due to their high computational and memory requirements. A significant inefficiency of conventional systems lies in their use of static network architectures, which have a constant computational complexity across all input segments regardless of their difficulty. This approach is sub-optimal for simpler segments that do not require intensive processing, such as silence or non-overlapping speech. To address this limitation, we propose a dynamic slimmable network (DSN) for speech separation that adaptively adjusts its computational complexity at inference time based on the input signal. The DSN combines a slimmable network with a gating mechanism that dynamically determines the network width by analyzing the characteristics of input segments. To guide the network toward a better balance between efficiency and performance, we introduce a signal-dependent complexity loss that leverages the segmental reconstruction quality as a proxy for segment difficulty. Experiments on clean and noisy two-speaker from the WSJ0-2mix and WHAM! datasets show that the DSN achieves competitive separation performance compared to static networks while significantly reducing the computational cost.

Audio Examples

WSJ0-2mix (100% Overlap)

[Sample 1] [Sample 2] [Sample 3]

WSJ0-2mix (50% Overlap)

[Sample 1] [Sample 2] [Sample 3]

WSJ0-2mix (25% Overlap)

[Sample 1] [Sample 2] [Sample 3]

WHAM! (100% Overlap | 100% Noise Activity)

[Sample 1] [Sample 2] [Sample 3]

WHAM! (100% Overlap | 50% Noise Activity)

[Sample 1] [Sample 2] [Sample 3]

WHAM! (25% Overlap | 100% Noise Activity)

[Sample 1] [Sample 2] [Sample 3]

WHAM! (25% Overlap | 50% Noise Activity)

[Sample 1] [Sample 2] [Sample 3]