Parametric Stereo Coding

Author: Jürgen Herre

Background information: Spatial perception

Although spatial perception is far from being fully understood, it is known that directional localization of sounds depends on the evaluation of so-called spatial cues by the human auditory system (see e.g. Blauert's book on spatial hearing [1]), the most important cues being

interaural level differences (i.e. the differences in levels received by both ears)
interaural phase differences (i.e. the differences in signal phase received by both ears)
preservation of fine temporal signal envelope structure

Consequently, the fidelity of the stereo image of a coded signal depends on the coder's ability to preserve these critical cues appropriately.

Background information: Joint Stereo Coding

For coding of high quality stereophonic (or multi-channel) audio signals at low bit rates, joint coding techniques have proven to be extremely valuable. On one hand they provide mechanisms to account for binaural psychoacoustic effects, on the other hand the required bit rate for the stereophonic signals may be reduced significantly below the bit rate for separate coding of the input channels.

Currently, the most common joint stereo coding techniques are Mid/Side (M/S) stereo coding [2] and Intensity Stereo coding [3] [4]. While the first method can account for binaural masking effects and achieve a certain amount of signal-dependent gain, the intensity stereo method provides a high potential for bit saving. Coders based on the intensity stereo principle have been described in the past for stereophonic and multi-channel coding under various names (e.g. "dynamic crosstalk" [5], "channel coupling" [6]).

Intensity stereo exploits the fact that the perception of high frequency sound components (e.g. above 4 kHz) mainly relies on the analysis of their energy-time envelopes [1] rather than the waveform itself. Thus, it is assumed sufficient to code the envelope of such a signal instead of its waveform. This is done by transmitting one common set of spectral coefficients ("carrier signal") that is shared among several audio channels instead of separate sets for each particular one. In the decoder, the carrier signal is scaled independently for each signal channel to match its original average envelope (or signal energy) for the respective coder frame. The scaling information is calculated and transmitted once for each group of spectral coefficients (scalefactor band). Effectively, the stereo image is recreated at the decoder side by a pan-pot-like operation for each spectral coder band.

Figure 1: Basic principle of intensity stereo coding

Some typical stereo image loss problems:

As a consequence of the intensity stereo coding / decoding process, all output signals reconstructed from a single carrier are scaled versions of each other, i.e. they have the same envelope fine structure for the duration of the coded block (e.g. 10-20 ms). This does not present a major problem for stationary signals or signals having similar envelope fine structures in the intensity stereo coded channels.

For transient signals with dissimilar envelopes in different channels, however, the original distribution of the envelope onsets between the coded channels cannot be recovered. Figures 2 and 3 show an example for this constellation: In a stereophonic recording of an applauding audience, the individual envelopes will be very different in the right and left channel due to the distinct clapping events happening at different times in both channels (see Figure 2 for left and right channel envelopes).

Figure 2: High frequency envelope structures of left and right original channel signals. Excerpt from "applause" item

After the intensity stereo encoding / decoding process, the fine time structure of the signals is mostly the same in both channels as can be seen in Figure 3. In particular, there is a structural "cross-talk" between the channels, such that perceptually important signal onsets propagate to the other opposite channel (e.g. L->R at 15 ms, R->L at 47 ms, L->R at 57 ms, L->R at 70 ms).

Figure 3: High frequency envelope structures of left and right channel signals after Intensity Stereo encoding/decoding (red). Envelopes of the original signals are shown in dotted yellow lines. Excerpt from "applause" item

Consequently, the stereo image quality of the intensity stereo coded / decoded signal will decrease significantly in such cases. The spatial impression tends to narrow down and the perceived stereo image collapses into the center position. For critical signals, like the applauding audience example, the achieved quality cannot be considered as acceptable anymore.

Sound examples

The following sound excerpts illustrate the discussed effects:

Play Applause: Original stereo sound excerpt:
A wide stereo image
Play Applause 6k: Intensity stereo encoded/decoded
(starting from 6 kHz)
Play Applause 4k: Intensity stereo encoded/decoded
(starting from 4 kHz)
Play Applause 2k: Intensity stereo encoded/decoded
(starting from 2 kHz)
Play Applause 1k: Intensity stereo encoded/decoded
(starting from 1 kHz)

The provided example sound files demonstrate the original applause recording as well as three sound examples with increasing deficiencies in stereo imaging quality, as would be produced by a coder without a proper control of the intensity stereo coding mechanism. Please observe the increasing loss of people applauding in the outer left and outer right seats as well as the overall lack of spatial impression and distinct reproduction of the single clap events.

Parametric Stereo and Multi-Channel Coding

Parametric coding of stereo or multi-channel signals (“Spatial Audio Coding”) is a generalization of the Intensity Stereo Coding concept and emerged successfully shortly after the year 2000. Like bandwidth extension, it contributed significantly to the state-of-the-art in offering good quality spatial audio even at very low bitrates.

Codecs with parametric coding of two or more channels generally reduce the original material to a mono (or stereo) downmix and a compact parametric side information that represents the most salient perceptual aspects of the spatial sound image, including Inter-Channel Level/Intensity Differences (ICLDs/ICIDs)), Inter-Channel Phase/Time Differences (ICPD/ICTD) and Inter-Channel Coherence/Correlation (ICC). In contrast to traditional intensity stereo coding, parametric stereo/multi-channel coding is applied to the full audio bandwidth. Like other parametric coding approaches, it is not waveform-preserving, i.e. it does not attempt to reproduce the original waveforms but produces similar output sounding comparable to the original signals.

The first widely successful parametric stereo/multi-channel coding techniques were Binaural Cue Coding (BCC) [7] and Parametric Stereo (PS) [8], the latter being used in the MPEG-4 High-Efficiency v2 (HE-AAC v2) codec [9]. MPEG Surround [10] provides a further generalization of this concept supporting efficient coding from stereo up to 3D audio formats, such as 7.1+4H, i.e. a 7.1 setup with 4 additional height speakers. Together with bandwidth extension, this technique allows good audio quality even at very low bitrates (e.g. 32 kbit/s for a stereo signal, and below 64 kbit/s for 5.1).

Parametric coding of stereo/multi-channel audio includes the following steps:

Extraction of the properties of the original spatial sound image (most notably as inter-channel cues in the time/frequency plane)
Compact quantization, coding and transmission of this information as additional bitstream side information
Conversion of the original signal into a mono or stereo downmix which is transmitted to the decoder (e.g. via a conventional audio codec approach)
Upmix of the transmitted downmix signal to the target channel format reinstating the original inter-channel cues that are transmitted as parametric information

Other generalizations of the concept followed later, such as Spatial Audio Object Coding (SAOC) [11] that provides efficient parametric coding of several object (rather than channel) signals.

Generally speaking, parametric multi-channel processing does not produce fully transparent output but perceptually convincing results. Still, for very critical signals, spatial (and other) artifacts can be introduced which are best audible with headphone listening:

Timbral artifacts (modulation/roughness, coloration)
Different sense of acoustic space
Narrowed stereo image compared to the original
Blurred or widened stereo image compared to the original
Unstable spatial image (fluctuation of perceived position and/or width)

Examples

The examples demonstrate the following versions:

Original is the uncoded reference.
Mono Sum is the mono downmix signal used by the parametric stereo coder. It is transmitted as waveform from the parametric encoder to the decoder in the bitstream. The decoder then produces a stereo signal by resynthesizing the signal’s spatial aspects.
ICLD Coded is a resynthesis of the stereo signal by synthesizing only ICLDs per critical band. It illustrates the contribution of the ICLD synthesis for restoring the stereo image. Among all binaural cues, ICLD cues are the most important ones. Conceptually, this corresponds to an enhanced version of "intensity stereo".
ICLD Coded + ICC0 extends the stereo width of the ICLD coded version by (approximately) synthesizing ICC=0 between left and right signals (full "decorrelation"). This is provided to illustrate the effect of exaggerated ICC synthesis and results in an excessively wide stereo image. The process of decorrelation is similar to adding synthetic reverb for a wider stereo image.
ICLD+ICC Coded is a resynthesis of the stereo signal by synthesizing both ICLDs as well as ICCs to restore the signal’s original binaural cues. This corresponds to the regular operation mode of a spatially parametric coding scheme.

The first example is a stereo signal with two talkers, panned to left and right sides, respectively. In this case, merely synthesizing ICLDs already results in a signal reconstruction that is perceptually very similar to the original. However, a residual amount of spatial instability remains and there is the tendency that additional sound is perceived in the middle of the stereo image ("phantom whisperer") that was not present in the original. Adding ICC synthesis improves on these aspects (as demonstrated by the ICLD+ICC Coded version). However, it also has the tendency of introducing other artifacts, as can be heard in the exaggerated ICLD Coded + ICC0 version.

Example 1: 2x Female Speech

Original Mono Sum ICLD Coded ICLD + ICC0 ICLD + ICC Coded

The second and third examples are pop music excerpts that utilize not only panning, but also stereophonic instruments and reverb. In this case, merely synthesizing ICLD results in a signal sounding narrower and with less room ambience than the original. Adding ICC synthesis widens the stereo image. Without properly controlled ICC synthesis, the resulting stereo image is too wide and blurry compared to the original stereo signal. With properly coded ICLD + ICC synthesis, the spaciousness of the original stereo signal can be restored.

Example 2: Funky

Original Mono Sum ICLD Coded ICLD + ICC0 ICLD + ICC Coded

Example 3: Gilmour

Original Mono Sum ICLD Coded ICLD + ICC0 ICLD + ICC Coded

References

[1] J. Blauert, "Spatial Hearing", MIT Press, 1983
[2] J. D. Johnston, A. J. Ferreira: "Sum-Difference Stereo Transform Coding", IEEE ICASSP 1992, pp. 569-571
[3] R.G.v.d. Waal, R.N.J. Veldhuis, "Subband Coding of Stereophonic Digital Audio Signals", IEEE ICASSP 1991, pp. 3601 - 3604
[4] J. Herre, K. Brandenburg, D. Lederer, "Intensity Stereo Coding", 96th AES Convention, Amsterdam 1994, Preprint #3799
[5] G. Stoll, G. Theile, S. Nielsen, A. Silzle, M. Link, R. Sedlmayer, A. Breford, "Extension of ISO/MPEG-Audio Layer II to Multi-Channel Coding: The Future Standard for Broadcasting, Telecommunication, and Multimedia Applications", presented at the 94th AES Convention, Berlin 1994, Preprint # 3550
[6] Mark Davis, "The AC-3 Multichannel Coder", 95th AES Convention, New York October 1993, Preprint # 3774
[7] C. Faller and F. Baumgarte, “Binaural Cue Coding—Part II: Schemes and Applications,” IEEE Trans. Speech Audio Processing, vol. 11 (2003 Nov.).
[8] E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegård: “Low complexity parametric stereo coding”, Proc. 116th AES convention, Berlin, Germany, 2004, Paper 6073.
[9] J. Herre, M. Dietz: "Standards in a Nutshell: MPEG-4 High-Efficiency AAC Coding", IEEE Signal Processing Magazine, Volume 25, Issue 3, pp 137 - 142, May 2008.
[10] J. Herre, K. Kjörling, J. Breebaart, C. Faller, S. Disch, H. Purnhagen, J. Koppens, J. Hilpert, J. Rödén. W. Oomen, K. Linzmeier, K. S. Chong: "MPEG Surround – The ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding", Journal of the AES, Vol. 56, No. 11, November 2008, pp. 932-955
[11] J. Herre, H. Purnhagen, J. Koppens, O. Hellmuth, J. Engdegård, J. Hilpert, L. Villemoes, L. Terentiv, C. Falch, A. Hölzer, M. L. Valero, B. Resch, H. Mundt, and H. Oh: "MPEG Spatial Audio Object Coding – The ISO/MPEG Standard for Efficient Coding of Interactive Audio Scenes", Journal of the AES, Vol. 60, No. 9, September 2012, pp. 655-673