Fast convolution

Lab Course
International Audio Laboratories Erlangen
Prof. Dr. Frank Wefers
Winter Term 2019

Fast convolution, © September 2018

This experiment gives an overview on finite impulse response (FIR) filters and their efficient realization with fast convolution techniques. You will examine, program and test implementations of such filters, in the time and frequency domain.

Digital filters are used to modify, enhance, alter audio signals in countless applications. Filters can be categorized into different classes based on their properties and structure. A filter maps an input signal $x(n)$ to an output signal $y(n)$. Those filters with a linear input-to-output relationship between $a\cdot x(n) \rightarrow a\cdot y(n)$ are called linear filters. Adding the constraint of time-invariance $x(n-k) \rightarrow y(n-k)$ leads to linear time-invariant filters. This particular class suits a wide range of applications, including equalization (EQ), 3D audio rendering, spatial sound playback and reverberation effects.

A linear time-invariant filter consists of three building blocks only: additions, multiplications and delays. Depending on the arrangement of these units, the filter has either feed-forward structure or it contains feedback loops. Feedback loops usually result in an infinitely long decaying impulse response, that never reaches exactly zero. Hence this sub class is called infinite impulse response filters (IIR filters). When the filter is free of feedback-loops and feed-forward only, the finite number of internal building blocks will result in a finite-length impulse response. Such a filter is called finite impulse response filter (FIR filter). In this exercise we consider FIR filters only. FIR filters can have different internal structures (block diagrams). However, as their impulse response is finite, one can always arrange them in a unique way, called the direct form. It is shown below. This block diagram marks the starting point of our considerations.

A direct form FIR filter works as follows: The input signal $x(n)$ is delayed sample by sample. All intermediate signals $x(n), x(n\!-\!1), x(n\!-\!2), \dotsc$ are weighted by corresponding filter coefficients $h_0, h_1, h_2, \dotsc$, and summed up to form the output, given by

\begin{equation} y(n) = \sum\limits_{k=0}^{N-1} x(n-k)\cdot h(k) = x(n) \ast h(n) \end{equation}

This sum formula above is known as discrete convolution. It can be regarded as a mathematical operator $\ast$, that takes two input sequences and produces one output sequence. Looking at the block diagram it becomes clear, that the filter coefficients fully determines the FIR filter ("fingerprint"). The impulse response $h(n)$ is its "fingerprint". Convolution allows us to compute the output $y(n)$ for any input $x(n)$, when $h(n)$ is known.

The block diagram and formula describe essentially the same thing. FIR filtering means computing discrete convolutions and vice versa. Hence, a fast convolution allows us to realize an FIR filter efficiently. What this exactly means, you will find out in the next minutes.

Homework Exercise 1
Two sequences of length $M$ and $N$ are convolved.
What is the length of the result?
Homework Exercise 2
Convolve the sequences $x(n)=[1, 1, 1, 1, 1]$ and $h(n)=[1, 1, 1]$.
What is result $y(n)=x(n)\ast h(n)$?

2.1 Implementation and testing

Let us start programming. First, install some Python packages needed for this exercise.

In [ ]:
!pip install soundfile librosa==0.6

Then, import them using the following cell.

In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import soundfile as sf

from IPython.display import Audio
from glob import glob
from time import process_time

Later in this exercise, the following helper functions will make life easier for you. Take a look at the functions and execute the cell to define them.

In [ ]:
def zeropad(x, N):
    """Pads a signal with zeros on the right to reach the given length"""
    M = len(x)
    assert M<=N, "Signal is longer then destination length"
    xp = np.zeros((N,))
    xp[0:M] = x
    return xp

def nextPowerOf2(n):
    """Returns the next larger power of two greater or equal n"""
    return 1 if n == 0 else 2**(n - 1).bit_length()

Now let us implement an FIR filter and verify that it works correctly. Afterwards we test it with different input signals and impulse responses and measure its computation times.

Lab Experiment 1
Implement the FIR filter in the block diagram shown above in the predefined function below.
Debug and test it with the two sequences you convolved by hand, until you verified that it produces correct results

Hints: The NumPy functions roll and dot may be helpful
In [ ]:
def convolution(x, h):
    """Compute the discrete convolution of two sequences"""
    M = len(x)
    N = len(h)

    # Your turn ...
    y =

    return y

For testing purposes you find audio files in the files directory. Let see what we have there...

In [ ]:
Lab Experiment 2
Test your filter with some of the provided audio files.
Listen to the input signal, filter impulse response and resulting output signal.
Print the length of the signals.

Which relationship between the lengths and runtimes do you observe?
In [ ]:
x, Fs ='files/speech.wav')
display( Audio(x, rate=Fs) )

# Your turn ...

2.2 Computational performance

How efficient can we filter the input signal $x(n)$ and compute its output $y(n)$? In other words, how many operations, processor instructions or clock cycles does the filtering consume? For an application that is important to know. The computational effort of an algorithm can be assessed in two different ways:

  1. We can analyze the algorithm (block diagram) and count the number of relevant operations. This usually includes arithmetic operations (i.e. additions and multiplications) only. The result is expressed as a function of the input lengths. This theoretic complexity allows us to compare different algorithms on paper ("is algorithm A superior to algorithm B?").

  2. We can measure execution times of an implementation of the algorithm on an actual machine. This captures all practical effects, such as the memory access, cache utilization, etc. The result are runtimes or equivalent clock cycles for specific input lengths.

Homework Exercise 3
Examine the block diagram above.

1. How many operations does it take to filter an input signal $x(n)$ of length $M$ with the impulse response $h(n)$ of length $N$?
2. Consider the case that both operands are of length $N$: Come up with a simple upper bound on the number of operations ($\mathcal{O}$-notation).
Lab Experiment 3
Measure the actual runtime of $N\!\times\! N$ discrete convolutions for different length $N$. Plot the result in a 2D diagram (runtime $T$ [ms] over length $N$)

Hint: Measure times using the function process_time, which is already included.
Note: For more comparable values, you can also measure the NumPy function convolve
In [ ]:
sizes = range(1000, 11000, 1000)
runtimes_convolution = []
cycles = 20

for N in sizes:
    # Your turn ...

    runtimes_convolution.append( ... )

plt.plot(sizes, runtimes_convolution, '.-')
plt.xlabel('Size N')

FIR filtering in the time domain becomes computationally complex when operands get longer. Implementing an FIR filter as shown in block diagram, results at a number of operations that is quadratic over the operand lengths. For real-time applications this vast amount of operations might be even too to be handled. This computational burden is overcome by fast convolution techniques.

A fast convolution is considered an algorithm that computes an $N\!\times\!N$ discrete convolution in asymptotically less than $\mathcal{O}(N^2)$ steps. Several different concepts techniques are known to achieve this computational efficiency. The by far most widely used method is fast convolution with the Fast Fourier Transform (FFT), which we will regard here. This technique is known since the 1960s and became popular as FFT algorithms became available. Still, more than half a decade later, it remains the most often used building block for all kinds of signal processing and filtering techniques. Let us examine how it works.

3.1 Discrete Fourier Transform and Circular Convolution Property

The Discrete Fourier Transform (DFT) is a well-known transform, used in many fields of science and technology. It is defined as follows:

\begin{equation} x(n) \overset{\mathcal{DFT}}{\rightarrow} X(k) = \sum\limits_{n=0}^{N-1} x(n)e^{-2\pi i\frac{n k}{N}} \hspace{2em}\text{and}\hspace{2em} X(k) \overset{\mathcal{IDFT}}{\rightarrow} x(n) = \frac{1}{N} \sum\limits_{k=0}^{N-1} X(k) e^{2\pi i\frac{n k}{N}} \end{equation}

A length-$K$ DFT maps $K$ time domain samples $x(n)$ to $K$ spectral coefficients $X(k)$ in the frequency domain.

A inherent concept of the DFT is its periodicity, that occurs in time and frequency domain. A length-$K$ DFT always assumes that input and output are periodic every $K$ elements. Discrete convolution, needed for implementing FIR filters, is aperiodic and somewhat incompatible with the DFT. This is overcome by introducing a periodic convolution operator. The concept is illustrated below

Simply assume, that the input sequence $x(n)$ infinitely repeats to both sides (positive and negative $n$) every $K$ samples. This sequence is denoted by $\tilde{x}(n)$. Then, circular convolution is defined the same way as discrete convolution

\begin{equation} y(n) = \sum\limits_{k=0}^{N-1} \tilde{x}(n-k)\cdot h(k) = x(n) \circledast h(n) \end{equation}

The Circular Convolution Property (CCP) states, that circular convolution $\circledast$ in the time domain equals the pairwise product $\times$ of spectral coefficients in the frequency domain \begin{equation} x(n) \circledast h(n) = \mathcal{IDFT}\{ \mathcal{DFT} \{ x(n)\} \times \mathcal{DFT}\{ h(n)\} \} \end{equation}

This marks a huge computational improvement, as a $K\!\times\!K$ circular convolution in the time domain requires $\mathcal{O}(K^2)$ steps to compute. In the frequency domain it becomes just $K$ complex-valued multiplications, which can be computed in $\mathcal{O}(K)$ steps. Digging deeper in the mathematical background one will find, that this reduction results from a matrix diagonalization. When circular convolution is written as a matrix-vector product, the DFT marks an eigenvector base that diagonalizes this convolution matrix. What remains is a simple product of a diagonal matrix with a vector. This is where the impressive reduction originates from.

3.2 Fast Fourier Transforms

Computing a length-$K$ DFT/IDFT by the formulae above also requires $\mathcal{O}(K^2)$. Consequently, there would be no benefit from computing convolutions in the frequency domain. This changes, when we use Fast Fourier Transforms (FFT), that compute a length-$K$ DFT/IDFT by just $\mathcal{O}(K\log K)$ operations.

How is this possible? Details on FFT algorithms are way beyond the scope of this experiment. Therefore only a few important remarks are given here. More information can be found in referenced literature. In essence, an FFT algorithm hierachically decomposes long DFTs into shorter DFTs (divide-and-conquer strategy). This tree-like decomposition removes redudant computations, that occur when using the definition formulae. Short DFTs of lengths 8, 16, 32, etc. are realized with highly-optimized machine code (assembly language) called codelets. As many different decompositions as well as codelets exist, a planner determines the best (most efficient) computation strategy. This strategy and the resulting execution time highly depends on the length $K$ of the transform.

Fast Fourier Transforms can be computed of arbitrary lengths $K$, including prime numbers $K$. However, they execute most efficiently for lengths $K$, where the number $K$ is composite. This means, that the prime factorization of $K$ contains few prime factors with comparably large exponents (ideally pure powers-of-two or small integer multiples of these). Prime-length DFTs require computationally complex reformulations into non-prime-length transforms, making them comparibly inefficient. For this reason, the right choice of transform size determines the performance of the algorithm. In most cases, using the next larger of power two will perform faster than using the exact length required for the task.

3.3 Realizing discrete convolutions by circular convolutions

For realizing FIR filters we need to implement the discrete convolution, not circular convolution. Is there a way to achieve discrete convolution by circular convolution?

Recall, that a length-$K$ circular convolution maps $K, K$ input to $K$ output values. In other words, the circular convolution requires both inputs to have the same length $K$. Given two sequences $x(n), h(n)$ of length $M, N$ we first have to select a length $K\ge\max(M,N)$. Then we append zeros to the inputs, so that both sequences match the length $K$. This is referred to as zero padding.

Secondly, we need to make sure, that the result of the desired $M\times N$ discrete convolution fits into the selected length $K$. Otherwise, the overhanging results would be mirrored back at the beginning of the period and corrupt the output. This is called time-aliasing. How zero padding solves this problem can be seen in the figure above. Making the period long enough to store all convolution values effectively turns the circular convolution into the the aperiodic discrete convolution, that we desire. For the sake of an efficient FFT, we can also select length $K$ to be longer than the minimum.

Homework Exercise 4
Consider fast convolution with inputs of length $M$ and $N$.
Which is the smallest possible transform length $K$ you can use?
The complete FFT fast convolution algorithm is shown below.

It works as follows: - The input sequences are padded to length $K$ and then transformed using length-$K$ FFTs - The $K$ spectral coefficients are multiplied (complex-valued) - The resulting spectrum is transformed back into the time domain. Unwanted excess samples are trimmed off.
Lab Experiment 4
Implement an FFT-based fast convolution in the predefined function below.
As before, debug and test it with the two sequences you convolved by hand.
Observe what happens, when you override the transform length $K$ and use smaller/longer values.

Hints: Use the NumPy functions fft.fft and fft.ifft to compute FFTs and IFFTs. After the inverse transform, extract the real parts only using the function real. This makes the results much more readable.
In [ ]:
def fftconvolution(x, h, K=None):
    """Fast convolution of two sequences using the FFT"""
    M = len(x)
    N = len(h)

    # No predefined transform length? Choose a suitable one!
    if K is None: K = nextPowerOf2(M+N-1)

    # Your turn ...

    return np.real( ... )

3.4 Computational performance

Lab Experiment 5
Now for the fast algorithm, measure the runtime of $N\!\times\! N$-convolutions as done before.
Plot the new results together with the previous results in a 2D diagram (runtime $T$ over length $N$)

How much are the savings by the fast processing?
What happens, when you use the minimum transform length?
In [ ]:
sizes = range(1000, 11000, 1000)
runtimes_fftconvolution = []
cycles = 20

for N in sizes:
    # Your turn ...

    runtimes_fftconvolution.append( ... )

plt.plot(sizes, runtimes_convolution, '.-', label='Discrete convolution')
plt.plot(sizes, runtimes_fftconvolution, '.-', label='Fast convolution')
plt.xlabel('Size N')

Performing a Fast Fourier Transform requires all input samples to be available at the time the transform is executed. The same holds for the inverse transform. Thus, the fast convolution technique explained above requires all input data to be available in advance. This makes it inapplicable for a real-time processing.

Real-time audio processing is done block-by-block. Here, the signals are recorded, processed and played back in blocks of a fixed number of samples. The input signal $x(n)$ and output signal $y(n)$ are continuous and of indefinite length. In essence they are partitioned into blocks. Convolution techniques that work in these split operands are called partitioned convolutions.

The signal input signal $x(n)$ consist of a continous stream of blocks $x^{(0)}(n), x^{(1)}(n), \dotsc$, each of length $B$. Superscripts indicate the index of the blocks. The block length $B$ (also known as audio buffer size) allows to control the audio latency. Usual block lengths for live effects and Virtual Reality systems are in range of 32, 64 or 128 samples. Less time-critical applications, such as media players or computer games, use lengths of 256 to 2048 samples. As the filtering is now realized on this continous stream of sample blocks, it is called running convolution. Note, that both, the input and output, are processed in length-$B$ blocks. This means, that the output signal $y(n)$ is also represented by a stream of blocks $y^{(0)}(n), y^{(1)}(n), \dotsc$, each of length $B$. Convolution result exceeding $B$ samples must be returned in later output blocks.

How we can we modify or adapt the fast convolution technique to apply for a real-time filtering as well? This becomes possible by two concepts, known as Overlap-Add and Overlap-Save. Following, we will examine the Overlap-Add technique.

4.1 Overlap-Add

One possibility is to filter each length-$B$ block with the length-$N$ impulse response. This will produce a convolution result longer than $B$. The overlapping samples are added up in a buffer and returned in subsequent output blocks. Hence the name, Overlap-Add (OLA). The algorithm is shown below

Lab Experiment 6
Implement the OLA convolution algorithm in the predefined function below.
As before, debug and test it with the two sequences you convolved by hand, until you verified that it produces correct results.
In [ ]:
def OLA_convolution(x, h, B, K=None, verbose=False):
    """Overlap-Add convolution of x and h with block length B"""

    M = len(x)
    N = len(h)

    # Calculate the number of input blocks
    num_input_blocks = ...

    # Pad x to an integer multiple of B
    xp = zeropad(x, num_input_blocks*B)

    # Your turn ...

    # Convolve all blocks
    for n in range(num_input_blocks):
        # Extract the n-th input block
        xb = ...
        if verbose: print("xb(%d) = " % n, xb)

        # Fast convolution
        u = ...
        if verbose: print("u(%d) = " % n, u)

        # Overlap-Add the partial convolution result
        y[...] += u[...]

    return y

4.2 Overlap-Save

The other possibility for realizing a running convolution is known as Overlap-Save (OLS) scheme. Here, the idea is to directly produce a ready-made output block of length-$B$ and avoid overlapping samples (additions). Recall, that any length-$B$ output block contains "decay" from previous input blocks. In the OLS scheme we will setup the input to the convolution in a way, that the output already contains The response of the latest block plus the decay of previous blocks. The algorithm is shown below.

The length-$K$ FFT of the input is fed with the $K$ latest input samples of $x(n)$. In other words, when a new block $x^{(i)}(n)$ arrives, we shift the input buffer to the left by $B$ samples and place the new samples rightmost.

This is known as a sliding window of the input. The filter impulse response is processed in the same way as for the OLA scheme.

After what we learned before, it becomes clear that we now perform a $K\times N$ convolution with a transform length of $K$. Hence time-aliasing occurs and excess samples will result in time-aliasing as explained above. However, this is acceptable, as the right-most $B$ samples are alias-free and valid output values. We can use/output them directly, without any additions, as for the Overlap-Add scheme.

Lab Experiment 7
Implement the OLS convolution algorithm in the predefined function below.
As before, debug and test it with the two sequences you convolved by hand, until you verified that it produces correct results

Hints: The NumPy functions roll is helpful to realize the sliding window
In [ ]:
def OLS_convolution(x, h, B, K=None, verbose=False):
    """Overlap-Save convolution of x and h with block length B"""

    M = len(x)
    N = len(h)

    # Calculate the number of input blocks
    num_input_blocks = ...

    # Pad x to an integer multiple of B
    xp = zeropad(x, num_input_blocks*B)

    # Your turn ...

    # Input buffer
    xw = np.zeros((K,))

    # Convolve all blocks
    for n in range(num_input_blocks):
        # Extract the n-th input block
        xb = xp[n*B:n*B+B]
        if verbose: print("xb(%d) = " % n, xb)

        # Sliding window of the input
        # Your turn ...
        if verbose: print("xw(%d) = " % n, xw)

        # Fast convolution
           # Your turn ...
        u = ...
        if verbose: print("u(%d) = " % n, u)

        # Save the valid output samples
        y[...] = u[...]

    return y

4.3 Computational performance

Finally, at the end of this exercise, let us measure the performance of partitioned convolutions and compare them to the other techniques. Therefore, you can use a practical example.

Lab Experiment 8
Filter the audio file "files/music.wav" with the lowpass filter impulse response "files/lowpass.wav".
Measure the runtime of all algorithms you have implemented so far.
For the partitioned convolution techniques assume two block lengths:
128 samples for a low-latency and 1024 samples for a moderate latency.

What do you observe?

Important: Again, perform every measurement three times and compute the average value to obtain representative numbers.
In [ ]:
# Your turn ..

Thank you for participating in this exercise. Feedback on this exercise is welcome in order to improve it. For further information on the topic the following literature is suggested:


Frank Wefers
Partitioned convolution algorithms for real-time auralization (PhD thesis)
ISBN: 978-3-8325-3943-6
Logos Verlag, Berlin, 2015
Online available here

Matteo Frigo and Steven G. Johnson
The Design and Implementation of FFTW3
Proceedings of the IEEE 93 (2), 216–231 (2005).
Online available here


The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universitaet Erlangen-Nuernberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS.