Speech Reverberation Artifacts

Author: Gerald Schuller
Co-Author: Jürgen Herre

The goal of a high compression ratio in perceptual audio coding has historically led to the use of transforms with a large block size or filter banks with many bands (i.e. high frequency resolution). Such spectral decompositions are suitable to obtain high coding gains for the mostly stationary parts in music signals. On the other hand, due to the so-called uncertainty principle, a high frequency resolution implies a low temporal resolution of the time/frequency representation. Thus, a high frequency resolution results in a poor control over the temporal shape of the quantization noise in the decoded audio signal, which may lead to coding artifacts for less stationary audio material. This can be perceived, for instance, as reverberation or echoiness in speech signals or pre-echoes at the "attack" portions of transient signals, such as castanets. More background information on this effect can be found in the section on the Pre-Echo phenomenon.

To obtain a better control over the temporal shaping of the quantization noise, a number of techniques were proposed over time:

One of the most popular approaches involves switching the transform / filter bank to a smaller number of bands, thus temporarily increasing temporal resolution at the expense of frequency resolution [1]. For many perceptual audio coders, this leads to two modes, one with a small number of bands (e.g. 128 bands) for very non-stationary parts of the signals (as the attacks of castanets), and one with a large number of bands (e.g. 1024 bands) for the more stationary parts of the signal. There are, however, signals which are somewhat non-stationary, but frequent switching to a lower number of bands cannot be afforded due to reasons of coding efficiency. Coding such signals with a high number of bands then leads to audible reverberation-like artifacts. Speech signals frequently fall into this category of "intermediate" signals and thus may be a challenge for perceptual audio coders.
Another approach which allows control over the temporal characteristics of the quantization noise in the decoded audio signal even for a high number of bands is called Temporal Noise Shaping (TNS) [2] and involves predictive filtering of the spectral coefficients across frequency. In this way the temporal quantization noise follows the signal more closely and thus temporal unmasking is minimized.
As a time-domain counterpart of the TNS algorithm, also frequency selective companding of the input signal prior to its quantization can be used to avoid artifacts by temporal unmasking [3] [4].
A later alternative, called the companding tool in AC-4 [5] utilizes QMF domain pre- and post-processing around the transform coding system. Prior to transform-domain encoding, the dynamic range of the signal is reduced locally within a QMF time slot and restored again post transform-domain decoding, which naturally shapes the coding noise temporally.

Sound examples

In order to demonstrate the perceptual quality of artifacts relating to temporal smearing of quantization noise as they may appear in perceptual audio coding, several sound excerpts were generated using a MATLAB program. The signals illustrate the effect of quantization noise which is injected into the signals spectral coefficients for various frequency resolutions (number of filter bank bands) and overall noise levels. This resembles the behavior of a simple perceptual audio coder which is not equipped with any of the precautions for controlling the temporal shape of the coding distortion, as they were discussed previously.

The signals were generated using the popular Modified Discrete Cosine Transform (MDCT) [6], which is used in many of today's coding schemes and a sine window. There are two parameters which were varied across the different signal versions:

Window Size / Number of Filter Bank Channels:
As the size of the filter bank window increases, a better frequency resolution is achieved at the expense of a decreased temporal resolution.
Distortion Level:
The processed signals contain a controlled level of noise over frequency, which is indicated relative to an (arbitrarily chosen) reference level. Processed signals with lower noise levels may be used to subsequently train critical listening with more subtle versions of the artifacts.

The signals are based on a speech recording ("German Male Speech") from the SQAM CD of the European Broadcasting Union (EBU) which has proven to be critical for many coding schemes and thus was used in many official listening tests for coder evaluation. It is recommended that these samples are heard over headphones, otherwise the room reverberation might mask the artifacts.

Original Signal (unprocessed)

	Processed (0 dB)	Processed (-3 dB)	Processed (-6 dB)	Processed (-9 dB)
2048 Filter Bank Bands, Window Size: 4096 samples (92.9ms)	Play	Play	Play	Play
1024 Filter Bank Bands, Window Size: 2048 samples (46.4ms)	Play	Play	Play	Play
512 Filter Bank Bands, Window Size: 1024 samples (23.2ms)	Play	Play	Play	Play
256 Filter Bank Bands, Window Size: 512 samples (11.6ms)	Play	Play	Play	Play

The following effects can be noticed when listening to the sound excerpts: The temporal "smearing" of the distortion introduces a reverberant quality into the speech signal which increases significantly with the length of the filter bank window. For large window sizes, the effect is even audible at rather small distortion levels. Accordingly, proper use of additional measures for preventing temporal unmasking is of high importance for audio coders with a high frequency resolution / number of filter bank channels.

To demonstrate the character of this coding artifact with a real coder, the number of subbands in an audio coder was artificially fixed to 1024, to avoid switching to its 128 band mode. This increases the coding artifacts to make them more audible. The test signal again consists of German male speech, a typical test signal were artifacts are easily produced and detected. The signal has been coded at a sampling rate of 32 kHz and a bit-rate of 64 kb/s (for the stereo file). Again, it is recommended that these samples are heard over headphones, otherwise the real room reverberation might mask the artifacts:

Original (German male speech) Encoded and decoded with 1024 bands only

It should be easy to hear that the encoded and decoded signal features more "reverberation". This is because perceptual audio coders use a psycho-acoustic model to spectrally shape the quantization noise. The spectral shape has similarity to the audio signal. Because of the lack of temporal control in the 1024 band mode, this noise also appears just before and after the sound elements of speech, where the quantization noise sounds like room echoes.

References

[1] B. Edler: "Codierung von Audiosignalen mit überlappender Transformation und adaptiven Fensterfunktionen", Frequenz, Vol. 43, pp. 252-256, 1989
[2] J. Herre, J. D. Johnston: "Enhancing the Performance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS)", 101st AES Convention, Los Angeles 1996, Preprint 4384
[3] T. Vaupel: "Ein Beitrag zur Transformationscodierung von Audiosignalen unter Verwendung der Methode der 'Time Domain Aliasing Cancellation (TDAC)' und einer Signalkompandierung im Zeitbereich", PhD Thesis, Universität-Gesamthochschule Duisburg, Germany, 1991
[4] B. Edler, C. Faller, G. Schuller: "Perceptual Audio Coding Using a Time-Varying Linear Pre- and Post-Filter", 109th AES Convention, Los Angeles 2000, Preprint 5274
[5] A. Biswas, P. Hedelin, L. Villemoes, and V. Melkote, “Temporal noise shaping with companding,” in Interspeech 2018, September 2-6, Hyderabad, India, Proceedings, 2018, pp. 3548–3552.
[6] J. Princen, A. Johnson, A. Bradley: "Subband/Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation", IEEE ICASSP 1987, pp. 2161 - 2164

Note: Some of the audio source excerpts have been taken from the SQAM CD [Cat. No. 422204-2] by kind permission of the European Broadcasting Union (EBU)