GPU_FFT

Back to projects

GPU_FFT release 3.0 is a Fast Fourier Transform library for the Raspberry Pi which exploits the BCM2835 SoC GPU hardware to deliver ten times more data throughput than is possible on the 700 MHz ARM of the original Raspberry Pi 1. Kernels are provided for all power-of-2 FFT lengths between 256 and 4,194,304 points inclusive. Accuracy has been significantly improved, without compromising performance, using a novel floating-point precision enhancement technique.

Raspberry Pi Foundation CEO Eben Upton attended the Radio Society of Great Britain (RSGB) 2013 Annual Convention, where radio amateurs told him FFT performance of its 700 MHz ARM limited the Pi's usefulness in Software Defined Radio (SDR) applications. That was the impetus which led me to develop GPU_FFT; and I wish to thank Eben Upton, Dom Cobley, Tim Mamtora, Gary Keall and Dave Emett for helping.

GPU_FFT runs on the quad processing units (QPUs) in the BCM2835 V3D block. There is a huge amount of interest in general-purpose programming of graphics hardware; and, in February 2014, Broadcom gave the Pi community a fantastic present by publishing the Videocore IV V3D Architecture Specification. Now, anybody can develop code for the GPU and some pioneers already have. VC4ASM is a full-featured macro assembler by Marcel Müller.

Similar throughput is now achievable using NEON instructions on the Pi 2 ARM v7; however, GPU_FFT remains useful on the Pi 1 and as an example of general-purpose GPU (GPGPU) programming.

What's new in release 3.0?

What was new in release 2.0?

Getting started

This article is about how GPU_FFT works; but first a few pointers on how to use it. GPU_FFT is distributed with the Raspbian operating system. The following commands build and run the hello_fft demo:

cd /opt/vc/src/hello_pi/hello_fft
make
sudo ./hello_fft.bin 12

These files contain all the information you need to use GPU_FFT:

hello_fft.cDemo source code
hello_fft_2d.c
gpu_fft.txtUser Guide

The hello_fft demo isn't very spectacular. It just computes some performance figures. The newly released hello_fft_2d demo is a bit more visual:

make hello_fft_2d.bin
sudo ./hello_fft_2d.bin

It generated the above image. Now back to how GPU_FFT works ...

Bit reversal

Two ways of implementing the classic radix-2 Cooley-Tukey FFT algorithm are: decimation in time (DIT) and decimation in frequency (DIF). Both yield the same result; they differ in the ordering of inputs and outputs. The choice is either to read inputs or write outputs in bit-reversed-ordinal sequence. Neither option is attractive when transforming data in SDRAM because memory controllers transfer blocks of consecutive addresses in bursts. Randomly accessing memory in the way bit-reversal dictates is expensive if there is more data than will fit in cache.

Decimation in time     Decimation in frequency
 
ordrev
000000
001100
010010
011110
100001
101101
110011
111111
 

Butterflies

The eight-point transforms depicted above have three so-called butterfly stages. In general, an N-point FFT is computed in log2(N) stages by the radix-2 algorithm. A naïve implementation might transfer data to and from SDRAM at every stage; but to do so log2(N) times is inefficient. Ideally, we would perform multiple stages of computation in registers and minimise memory access. FFT performance on modern hardware is usually limited by memory bandwidth, not floating-point computation.

Say we could accommodate four complex numbers in registers. Looking at the above DIT data flow graph, if we loaded x(0), x(4), x(2) and x(6) from memory, we could compute the first two butterfly stages in registers; temporarily save the result to memory; and do the same for the odd-numbered ordinals 1, 5, 3 and 7; but then we hit a problem. The inputs to the third butterfly do not fit in registers and it gets worse with more stages. We will always run out of registers eventually, even if we have quite a lot as the BCM2835 QPU does.

Bit rotation

Compare these 256-point DIT FFT data flow graphs:

  DIT FFT 256 classic
DIT FFT 256 practical

The classic data flow graph illustrates the above mentioned implementational challenge: the number of inputs to the butterflys doubles at each stage. In the practical flow: extra "twists" (re-orderings) have been inserted after the fourth and final stages; no butterfly is then a function of more than sixteen inputs; all computation can be performed in registers using an FFT_16 codelet; and SDRAM is only read/written twice. The extra twists are bit-rotations not reversals. Ordinals are circularly shifted by four. Doing so twice in an eight-stage transform leaves outputs in the correct positions. Butterfly wing span doubles from 16 to 32 between the 4th and 5th stages of the classic flow. Rotating ordinals by 4-bits drops wing span back down to 2 again in the practical flow.

The above PDF files were generated by a C program, Butter, which was written to visualize how twiddles change as we step through the data. You will also need libHaru to build it, or, you can run the zipped executable if you trust me. I promise it is safe. The zipped code generates a 4096-point "practical" flow graph; but the #defines can be changed for other lengths.

Architecture

We would prefer to access memory "linearly" in increasing address order; however, rotations and reversals require "scattered" accesses. DIT unavoidably begins by reading data from bit-reversed-ordinal positions. Then there are two alternative ways to do the rotations:

GPU_FFT takes the high road because the VPM/VDW are good at bit-rotated writes and the same write code can then be used for both passes. But why DIT? Why not DIF? Because a bit-reversed write would be slow.

End-to-end data path through V3D hardware on Pi 1:

The TMU (texture & memory lookup unit) is good at fetching contiguous or scattered data from SDRAM, although it was not specifically optimised for bit-reversed reads. The TMU cache is directly-mapped, aliasing on 4k boundaries. Unfortunately, bit-reversal of consecutive ordinals makes upper address bits toggle, causing collisions and evictions! VC4 L2 Cache can only be used on Pi 1; and must be avoided for GPU_FFT and OpenGL to coexist. The ARM does not access SDRAM through VC4 L2 cache on Pi 2. It is safe for VDW output to bypass V3D internal caches, since they are smaller than the ping-pong data buffers.

Parallelism

The BCM2835 V3D QPU (quad processing unit) is a 16 SIMD (single instruction, multiple data) processor. Registers and data paths are 16*32 bits wide, representing vectors of 16 values. Instructions can act upon all 16 values in parallel, or, optionally, on selected values; and the 16 can be rotated within a vector. These features are exploited in GPU_FFT to implement a very fast FFT_16 codelet. Working with complex numbers, only two registers are needed to store 16 real and 16 imaginary parts. The TMU has a convenient programming interface which accepts a vector of 16 potentially scattered addresses and returns 16 words from SDRAM.

BCM2835 V3D has 12 QPUs; but GPU_FFT only uses 8. It happens that bit-rotated writes can be done very neatly, through the VPM and VDW, by 8 QPUs working in parallel. One is master and 7 are slaves, synchronised using semaphores. QPUs can access the VPM (vertex pipeline memory) either horizontally or vertically. Each QPU outputs real and imaginary vectors to VPM columns. The master then writes VPM rows to SDRAM through the VDW as a series of bursts, with a stride between rows:

QPU0
QPU1
QPU6
QPU7
rei+0 imi+0 rei+16imi+16 . . . rei+96 imi+96 rei+112imi+112base
rei+1 imi+1 rei+17imi+17 . . . rei+97 imi+97 rei+113imi+113base + stride
rei+2 imi+2 rei+18imi+18 . . . rei+98 imi+98 rei+114imi+114base + stride*2
rei+3 imi+3 rei+19imi+19 . . . rei+99 imi+99 rei+115imi+115base + stride*3
. . .. . . . . . . .
. . .. . VPM . . . . .
. . .. . . . . . . .
rei+12imi+12rei+28imi+28 . . . rei+108 imi+108 rei+124imi+124base + stride*12
rei+13imi+13rei+29imi+29 . . . rei+109 imi+109 rei+125imi+125base + stride*13
rei+14imi+14rei+30imi+30 . . . rei+110 imi+110 rei+126imi+126base + stride*14
rei+15imi+15rei+31imi+31 . . . rei+111 imi+111 rei+127imi+127base + stride*15

The VDW write setup STRIDE is specified as a 13-bit field on page 59 of the Videocore IV V3D Architecture Specification but luckily the hardware actually supports 16-bits. The value specified in the setup is 64 bytes less than the true stride shown above because the internal address register to which it is added has already advanced 16*4 bytes.

Codelets

The QPUs are kept in lock-step. In the FFT-256 example, each QPU consumes 16-points of data per step. Together they process 16*8 = 128 points. Only two such steps per pass and two passes are required to complete the entire FFT-256. Longer transforms require more steps, more passes and GPU_FFT also uses FFT-32 and FFT-64 codelets as follows:

N 256 512 1024 2048 4096 8192 16384 32768 65536 128K 256K 512K 1M 2M 4M
log2N 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Passes 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4
Structure † 4+4 5+4 5+5 6+5 4+4+4 5+4+4 5+5+4 5+5+5 6+5+5 5+4+4+4 5+4+4+5 5+4+5+5 5+5+5+5 6+5+5+5 6+6+5+5
† 4=FFT-16; 5=FFT-32; 6=FFT-64

Minimising the number of passes minimises SDRAM access.

log2N=18 and longer transforms require VDW write strides exceeding 65535 bytes and must therefore issue multiple single-row VDW write commands.

5+4+4+5 has lower root-mean-square (rms) error than 5+5+4+4.

128k and 256k could be done as 6+6+5 and 6+6+6. When these shaders were originally written, the FFT-64 codelet could only be used in the first pass. Now it can be used in any pass and they could be optimised.

Complex arithmetic

All floating-point arithmetic in GPU_FFT is on complex numbers. This code performs 16 complex multiplications in parallel:
    #
    # complex p(r0, r1) = complex a(r0, r1) * complex b(ra_re, rb_im)
    #
    nop; fmul r3, r1, rb_im # ii = a.im*b.im
    nop; fmul r2, r0, ra_re # rr = a.re*b.re
    nop; fmul r1, r1, ra_re # ir = a.im*b.re
    nop; fmul r0, r0, rb_im # ri = a.re*b.im
    fadd r1, r0, r1         # p.im = ri + ir
    fsub r0, r2, r3         # p.re = rr - ii

Twiddles

Twiddle factors are unit vectors in the complex plane, which "rotate" as we work through the data. Rotation in this context is the angular not the bit-shift variety! The size and frequency of the rotations varies between passes. Twiddles in the final pass change on every loop. Twiddles in the first pass do not change at all. The general rule for a 4-pass transform with codelet structure a+b+c+d is as follows:

Pass1234
Codeletabcd
Rotation (Δk)-c.dd1
Interval (points)-b.c.dc.dd

This 3-pass FFT-4096 flow graph helped me visualize the subtleties:

  DIT FFT 4096 (6MB zip of 33MB pdf) or download Butter.zip and run the executable.

Twiddle factors are applied to data within the codelets at the points indicated on the data flow. Each integer, k, represents a complex multiplication by

Ideally, twiddles should be calculated to a higher precision than the data; but the QPU only supports 32-bit single-precision floats. Trigonometric recurrences are used to minimise loss of precision:

 

Unfortunately, the QPU always rounds down (not to the nearest LSB) and significant rounding errors accumulated, seriously compromising accuracy in releases 1.0 and 2.0 of GPU_FFT. This has been addressed in release 3.0 using a precision enhancement technique based on the work of Dekker [2] and Thall [3] which corrects rounding errors, improving accuracy by almost three orders of magnitude in the longer transforms!

Thall and Dekker use the "unevaluated sum" of two single-precision floats to represent one double-precision quantity and provide primitives for addition, subtraction and multiplication. To avoid a performance hit, only the cosθ and sinθ terms are maintained at higher precision; and only the final two subtractions of the recurrence are evaluated using Thall's primitives. Two QPU instructions per iteration became twenty-six.

Real and imaginary parts are stored at offsets from ra_tw_re and rb_tw_im respectively:

CodeletOffsetDescription
FFT-16TW16_Pn_STEPper-step α, β rotation for nth pass
FFT-32TW32_Pn_STEP
FFT-64TW48_Pn_STEP
TW64_Pn_STEP
FFT-16TW16+01st stage unpacked twiddles
TW16+12nd stage unpacked twiddles
TW16+23rd stage unpacked twiddles
TW16+3, 4high-precision cosθ, sinθ
FFT-32TW32+0, 1
FFT-64TW48+0, 1
TW64+0, 1

STEP values and starting values for cosθ, sinθ are loaded from memory addresses passed-in through uniforms. Each of the 8 QPUs has unique (rx_tw_unique) starting values in the final pass. Other values are common (rx_tw_shared) to all shaders.

Twiddle management macros:

load_twLoad α, β and starting cosθ, sinθ
body_fft_16FFT-16 codelet
fft_twiddles_32FFT-32 codelet helper
unpack_twiddlesupdates TW16
rotaterotate one step
next_twiddles_16
next_twiddles_32
df64_sub32Enhanced-precision subtract

Using TW#i notation to denote the i th element of a vector, twiddles are packed in the TW16 and TW32 registers:

Element #0 of the TW16 registers is not used. Twiddles are "unpacked" to suit the 4 stages of the FFT-16 codelet and conditional flags control which elements are multiplied at each stage:

.macro body_fft_16
.rep i, 4
    and.setf -, elem_num, (1<<i)

TW16+0 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1
TW16+1 TW16#0 TW16#1 TW16#2 TW16#3 TW16#0 TW16#1 TW16#2 TW16#3 TW16#0 TW16#1 TW16#2 TW16#3 TW16#0 TW16#1 TW16#2 TW16#3
TW16+2 TW16#0 TW16#1 TW16#2 TW16#3 TW16#4 TW16#5 TW16#6 TW16#7 TW16#0 TW16#1 TW16#2 TW16#3 TW16#4 TW16#5 TW16#6 TW16#7
TW16+3 TW16#0 TW16#1 TW16#2 TW16#3 TW16#4 TW16#5 TW16#6 TW16#7 TW16#8 TW16#9 TW16#10 TW16#11 TW16#12 TW16#13 TW16#14 TW16#15
TW32 TW32#0 TW32#1 TW32#2 TW32#3 TW32#4 TW32#5 TW32#6 TW32#7 TW32#8 TW32#9 TW32#10 TW32#11 TW32#12 TW32#13 TW32#14 TW32#15

TMU read pipeline

Read-related macros:

read_revQueue TMU request for bit-reverse or linear (sequential) ordinal read
read_lin
load_revCall read_* to queue another request and wait for response from previous
load_lin
interleaveSwap request/response elements to improve cache-efficiency
swizzle
bit_revBit-reversal primitive

At least one request is always queued to avoid the performance penalty of TMU pipeline stalls. One too many requests are issued and one response must be thrown away after the loop. Pseudo-code:

    #######################################
    # Pass 1

    read_rev
    foreach step {
        load_rev
        ...
    }
    ldtmu0 # Wait for last (unwanted) response
    ldtmu0

    #######################################
    # Pass 2

    read_lin
    foreach step {
        load_lin
        ...
    }
    ldtmu0 # Wait for last (unwanted) response
    ldtmu0

Write subtleties

VPM and VDW write macros:

MacroCallerDescription
inst_vpmBoth Setup VPM column write commands for this QPU instance (0...7)
write_vpm_16Write codelet output to VPM columns
FFT_32 and FFT_64 final butterfly stage done here
write_vpm_32
write_vpm_64
body_ra_save_16Masterwrite_vpm_*, sync with slaves and queue VDW write
body_ra_save_32
body_ra_save_64
body_ra_syncSync with slaves
body_rx_save_slave_16Slaveswrite_vpm_* and sync wth master
body_rx_save_slave_32
body_rx_save_slave_64
body_rx_sync_slaveSync with master

FFT_16 and FFT_32 codelet VPM writes are double-buffered, allowing the slaves to proceed into the next step without waiting for the VDW write. FFT_16 alternates between the first 16 and second 16 VPM rows. FFT_32 alternates between the first 32 and second 32 VPM rows. FFT_64 is not double-buffered because user shaders can only use the first 64 rows. Note how the order of semaphore instructions in the FFT_64 body_*_save_* macros differs from the others. Notice also the use of VPM readbacks to ensure preceding writes have completed; and the postponement of vw_wait until the last moment.

Mailbox

GPU_FFT uses a kernel character device known as the "mailbox" for communication between the ARM and the BCM2835 VPU. Running start.elf from the SD-Card boot partition, the VPU is a powerful (and on the Pi under-used) vector processor that manages the Videocore hardware. GPU_FFT uses two mailbox API functions:
unsigned qpu_enable (int file_desc, unsigned enable);
unsigned execute_qpu(int file_desc, unsigned num_qpus, unsigned control, unsigned noflush, unsigned timeout);
The first enables or disables the GRAFX (V3D) power domain and opens a V3D driver handle for running user shaders. The second executes one or more shaders on the QPUs.

Unfortunately, calling through the mailbox involves a kernel context switch, incurring a ~ 100µs performance penalty. This is not an issue for qpu_enable() which need only be called twice; however, it is a significant overhead on execute_qpu() when computing short transforms. This is why GPU_FFT 1.0 allowed batches of transforms to be executed with a single mailbox call. GPU_FFT 3.0 still supports batch mode; but, when the expected shader runtime is below a #defined threshold, it avoids the mailbox overhead by poking V3D hardware registers directly from the ARM side and busy-waits for shader completion.

V3D hardware registers are documented in the Videocore IV V3D Architecture Specification.

Currently, programs which use the mailbox must run as root; however, Marcel Müller is developing vcio2 a kernel driver to access the Raspberry Pi VideoCore IV GPU without root privileges.

The mailbox special character device file has moved from ./char_dev to /dev/vcio in the latest release.

Alternative algorithms

Instead of calculating twiddle factors on-the-fly using recurrences, a memory look-up table can be used. This may be a full cycle, or just a single quadrant in size. The latter occupies less memory; but requires more pre- and post-look-up processing. Both consume TMU memory bandwidth. Performance of the log2N=20 Cooley-Tukey shader with twiddle look-up was investigated.

Stockham autosort FFT algorithms access data in "natural" order, avoiding the bit-reversal or "shuffle" step required by Cooley-Tukey. Radix-2 Stockham algorithms perform N-point transforms in log2N passes, effectively shuffling indices 1-bit per pass as they go. The number of passes can be reduced with higher radix codelets. Based on work published by Microsoft Research [1], a mixed-radix Stockham routine with twiddle look-up was prototyped, first in "C" and then in QPU assembler.

The above experiments produced modest accuracy improvements; but throughput was disappointing and performance counters implicated TMU cache aliasing.

References

1. High Performance Discrete Fourier Transforms on Graphics Processors
2. A Floating-Point Technique for Extending the Available Precision, T. J. DEKKER, 1971
3. Extended-Precision Floating-Point Numbers for GPU Computation, Andrew Thall, Alma College.

Links