GPU_FFT

Back to projects

GPU_FFT release 2.0 BETA is a Fast Fourier Transform library for the Raspberry Pi which exploits the BCM2835 SoC GPU hardware to deliver ten times more data throughput than is possible on the 700 MHz ARM. Kernels are provided for all power-of-2 FFT lengths between 256 and 1,048,576 points inclusive. GPU_FFT release 1.0 supported up to 131,072 points.

Raspberry Pi Foundation CEO Eben Upton attended the Radio Society of Great Britain (RSGB) 2013 Annual Convention, where radio amateurs told him FFT performance of its 700 MHz ARM limited the Pi's usefulness in Software Defined Radio (SDR) applications. That was the impetus which led me to develop GPU_FFT; and I wish to thank Eben Upton, Dom Cobley, Tim Mamtora, Gary Keall and Dave Emett for helping.

GPU_FFT runs on the quad processing units (QPUs) in the BCM2835 V3D block. There is a huge amount of interest in general-purpose programming of graphics hardware; and, in February 2014, Broadcom gave the Pi community a fantastic present by publishing the Videocore IV V3D Architecture Specification. Now, anybody can develop code for the GPU and some pioneers already have.

What's new in release 2.0?

Getting started

This article is about how GPU_FFT works; but first a few pointers on how to use it. GPU_FFT is distributed with the Raspbian operating system. The following commands build and run the hello_fft demo:

cd /opt/vc/src/hello_pi/hello_fft
make
sudo mknod char_dev c 100 0
sudo ./hello_fft.bin 12

These files contain all the information you need to use GPU_FFT:

hello_fft.cDemo source code
hello_fft_2d.c
gpu_fft.txtUser Guide

The hello_fft demo isn't very spectacular. It just computes some performance figures. The newly released hello_fft_2d demo is a bit more visual:

make hello_fft_2d.bin
sudo ./hello_fft_2d.bin

It generated the above image. Now back to how GPU_FFT works ...

Bit reversal

Two ways of implementing the classic radix-2 Cooley-Tukey FFT algorithm are: decimation in time (DIT) and decimation in frequency (DIF). Both yield the same result; they differ in the ordering of inputs and outputs. The choice is either to read inputs or write outputs in bit-reversed-ordinal sequence. Neither option is attractive when transforming data in SDRAM because memory controllers transfer blocks of consecutive addresses in bursts. Randomly accessing memory in the way bit-reversal dictates is expensive if there is more data than will fit in cache.

Decimation in time     Decimation in frequency
 
ordrev
000000
001100
010010
011110
100001
101101
110011
111111
 

Butterflys

The eight-point transforms depicted above have three so-called butterfly stages. In general, an N-point FFT is computed in log2(N) stages by the radix-2 algorithm. A naïve implementation might transfer data to and from SDRAM at every stage; but to do so N-times is very inefficient. Ideally, we would perform multiple stages of computation in registers and minimise memory access. FFT performance on modern hardware is usually limited by memory bandwidth, not floating-point computation.

Say we could accommodate four complex numbers in registers. Looking at the above DIT data flow graph, if we loaded x(0), x(4), x(2) and x(6) from memory, we could compute the first two butterfly stages in registers; temporarily save the result to memory; and do the same for the odd-numbered ordinals 1, 5, 3 and 7; but then we hit a problem. The inputs to the third butterfly do not fit in registers and it gets worse with more stages. We will always run out of registers eventually, even if we have quite a lot as the BCM2835 QPU does.

Bit rotation

Compare these 256-point DIT FFT data flow graphs:

  DIT FFT 256 classic
DIT FFT 256 practical

The classic data flow graph illustrates the above mentioned implementational challenge: the number of inputs to the butterflys doubles at each stage. In the practical flow: extra "twists" (re-orderings) have been inserted after the fourth and final stages; no butterfly is then a function of more than sixteen inputs; all computation can be performed in registers using an FFT_16 codelet; and SDRAM is only read/written twice. The extra twists are bit-rotations not reversals. Ordinals are circularly shifted by four. Doing so twice in an eight-stage transform leaves outputs in the correct positions. Butterfly wing span doubles from 16 to 32 between the 4th and 5th stages of the classic flow. Rotating ordinals by 4-bits drops wing span back down to 2 again in the practical flow.

The above PDF files were generated by a C program, Butter, which was written to visualize how twiddles change as we step through the data. You will also need libHaru to build it, or, you can run the zipped executable if you trust me. I promise it is safe. The zipped code generates a 4096-point "practical" flow graph; but the #defines can be changed for other lengths.

Architecture

We would prefer to access memory "linearly" in increasing address order; however, rotations and reversals require "scattered" accesses. DIT unavoidably begins by reading data from bit-reversed-ordinal positions. Then there are two alternative ways to do the rotations:

GPU_FFT takes the high road because the VPM/VDW are good at bit-rotated writes and the same write code can then be used for both passes. But why DIT? Why not DIF? Because a bit-reversed write would be slow.

End-to-end data path through V3D hardware:

The TMU (texture & memory lookup unit) is good at fetching contiguous or scattered data from SDRAM, although it was not specifically optimised for bit-reversed reads. The TMU cache is directly-mapped, aliasing on 4k boundaries. Unfortunately, bit-reversal of consecutive ordinals makes upper address bits toggle, causing collisions and evictions! Use of VC4 L2 Cache is optional and must be avoided (see GPU_FFT_MEM_* #defines) for GPU_FFT and OpenGL to coexist. It is safe for VDW output to bypass V3D internal caches, since they are smaller than the ping-pong data buffers.

Parallelism

The BCM2835 V3D QPU (quad processing unit) is a 16 SIMD (single instruction, multiple data) processor. Registers and data paths are 16*32 bits wide, representing vectors of 16 values. Instructions can act upon all 16 values in parallel, or, optionally, on selected values; and the 16 can be rotated within a vector. These features are exploited in GPU_FFT to implement a very fast FFT_16 codelet. Working with complex numbers, only two registers are needed to store 16 real and 16 imaginary parts. The TMU has a convenient programming interface which accepts a vector of 16 potentially scattered addresses and returns 16 words from SDRAM.

BCM2835 V3D has 12 QPUs; but GPU_FFT only uses 8. It happens that bit-rotated writes can be done very neatly, through the VPM and VDW, by 8 QPUs working in parallel. One is master and 7 are slaves, synchronised using semaphores. QPUs can access the VPM (vertex pipeline memory) either horizontally or vertically. Each QPU outputs real and imaginary vectors to VPM columns. The master then writes VPM rows to SDRAM through the VDW as a series of bursts, with a stride between rows:

QPU0
QPU1
QPU6
QPU7
rei+0 imi+0 rei+16imi+16 . . . rei+96 imi+96 rei+112imi+112base
rei+1 imi+1 rei+17imi+17 . . . rei+97 imi+97 rei+113imi+113base + stride
rei+2 imi+2 rei+18imi+18 . . . rei+98 imi+98 rei+114imi+114base + stride*2
rei+3 imi+3 rei+19imi+19 . . . rei+99 imi+99 rei+115imi+115base + stride*3
. . .. . . . . . . .
. . .. . VPM . . . . .
. . .. . . . . . . .
rei+12imi+12rei+28imi+28 . . . rei+108 imi+108 rei+124imi+124base + stride*12
rei+13imi+13rei+29imi+29 . . . rei+109 imi+109 rei+125imi+125base + stride*13
rei+14imi+14rei+30imi+30 . . . rei+110 imi+110 rei+126imi+126base + stride*14
rei+15imi+15rei+31imi+31 . . . rei+111 imi+111 rei+127imi+127base + stride*15

The VDW write setup STRIDE is specified as a 13-bit field on page 59 of the Videocore IV V3D Architecture Specification but luckily the hardware actually supports 16-bits. The value specified in the setup is 64 bytes less than the true stride shown above because the internal address register to which it is added has already advanced 16*4 bytes.

Codelets

The QPUs are kept in lock-step. In the FFT-256 example, each QPU consumes 16-points of data per step. Together they process 16*8 = 128 points. Only two such steps per pass and two passes are required to complete the entire FFT-256. Longer transforms require more steps, more passes and GPU_FFT also uses FFT-32 and FFT-64 codelets as follows:

N 256 512 1024 2048 4096 8192 16384 32768 65536 128K 256K † 512K † 1M †
log2N 8 9 10 11 12 13 14 15 16 17 18 19 20
Passes 2 2 2 2 3 3 3 3 3 4 4 4 4
Structure * 4+4 5+4 5+5 6+5 4+4+4 5+4+4 5+5+4 5+5+5 6+5+5 5+4+4+4 5+4+4+5 5+4+5+5 5+5+5+5
* 4=FFT-16; 5=FFT-32; 6=FFT-64
† Release 2.0 BETA

Minimising the number of passes minimises SDRAM access. So why not use even larger codelets? Short answer: there are not enough registers. For the same reason, the FFT-64 codelet can only be used on the first pass because its twiddles can't be rotated.

log2N=18 and longer transforms require VDW write strides exceeding 65535 bytes and must therefore issue multiple single-row VDW write commands.

5+4+4+5 has lower root-mean-square (rms) error than 5+5+4+4.

Complex arithmetic

All floating-point arithmetic in GPU_FFT is on complex numbers. This code performs 16 complex multiplications in parallel:
    #
    # complex p(r0, r1) = complex a(r0, r1) * complex b(ra_re, rb_im)
    #
    nop; fmul r3, r1, rb_im # ii = a.im*b.im
    nop; fmul r2, r0, ra_re # rr = a.re*b.re
    nop; fmul r1, r1, ra_re # ir = a.im*b.re
    nop; fmul r0, r0, rb_im # ri = a.re*b.im
    fadd r1, r0, r1         # p.im = ri + ir
    fsub r0, r2, r3         # p.re = rr - ii

Twiddles

Twiddle factors are unit vectors in the complex plane, which "rotate" as we work through the data. Rotation in this context is the angular not the bit-shift variety! The size and frequency of the rotations varies between passes. Twiddles in the final pass change on every loop. Twiddles in the first pass do not change at all. The general rule for a 4-pass transform with codelet structure a+b+c+d is as follows:

Pass1234
Codeletabcd
Rotation (Δk)-c.dd1
Interval (points)-b.c.dc.dd

This 3-pass FFT-4096 flow graph helped me visualize the subtleties:

  DIT FFT 4096 (6MB zip of 33MB pdf) or download Butter.zip and run the executable.

Twiddle factors are applied to data within the codelets at the points indicated on the data flow. Each integer, k, represents a complex multiplication by

Ideally, twiddles should be calculated to a higher precision than the data; but the QPU only supports 32-bit single-precision floats. Trigonometric recurrences are used to minimise loss of precision:

 

I wonder if 64-bits could be used? But much code and too many registers already are dedicated to managing twiddles. Real and imaginary parts are stored at offsets from ra_tw_re and rb_tw_im respectively:

OffsetCodeletDescription
TW16_Pn_BASEFFT-16starting twiddles for nth pass
TW32_Pn_BASEFFT-32
TW64_P1_BASE0FFT-64
TW64_P1_BASE1
TW16_Pn_STEPFFT-16per step rotation for nth pass
TW32_Pn_STEPFFT-32
TW16_ACTIVE+0FFT-161st stage unpacked twiddles
TW16_ACTIVE+12nd stage unpacked twiddles
TW16_ACTIVE+23rd stage unpacked twiddles
TW16_ACTIVE+3current twiddles
TW32_ACTIVEFFT-32

ACTIVE values are dynamically generated. BASE and STEP values are passed-in through uniforms. Each of the 8 QPUs has different (TW_UNIQUE) BASE twiddles in the final pass. BASE values for earlier passes and all STEP values are (TW_SHARED) common. Currently, all uniforms are copied to registers before pass 1. Probably, they should be loaded one pass at a time.

Twiddle management macros:

load_tw Load BASE and STEP from uniforms
body_fft_16 FFT-16 codelet
fft_twiddles_32FFT-32 codelet helper
unpack_twiddlesupdates TW16_ACTIVE
next_twiddles_16rotate one step
next_twiddles_32

Using TW#i notation to denote the i th element of a vector, twiddles are packed in the TW16 and TW32 registers:

Element #0 of the TW16 registers is not used. Twiddles are "unpacked" to suit the 4 stages of the FFT-16 codelet and conditional flags control which elements are multiplied at each stage:

.macro body_fft_16
.rep i, 4
    and.setf -, elem_num, (1<<i)

TW16_ACTIVE+0 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1 TW16#0 TW16#1
TW16_ACTIVE+1 TW16#0 TW16#1 TW16#2 TW16#3 TW16#0 TW16#1 TW16#2 TW16#3 TW16#0 TW16#1 TW16#2 TW16#3 TW16#0 TW16#1 TW16#2 TW16#3
TW16_ACTIVE+2 TW16#0 TW16#1 TW16#2 TW16#3 TW16#4 TW16#5 TW16#6 TW16#7 TW16#0 TW16#1 TW16#2 TW16#3 TW16#4 TW16#5 TW16#6 TW16#7
TW16_ACTIVE+3 TW16#0 TW16#1 TW16#2 TW16#3 TW16#4 TW16#5 TW16#6 TW16#7 TW16#8 TW16#9 TW16#10 TW16#11 TW16#12 TW16#13 TW16#14 TW16#15
TW32_ACTIVE TW32#0 TW32#1 TW32#2 TW32#3 TW32#4 TW32#5 TW32#6 TW32#7 TW32#8 TW32#9 TW32#10 TW32#11 TW32#12 TW32#13 TW32#14 TW32#15

TMU read pipeline

Read-related macros:

read_revQueue TMU request for bit-reverse or linear (sequential) ordinal read
read_lin
load_revCall read_* to queue another request and wait for response from previous
load_lin
interleaveSwap request/response elements to improve cache-efficiency
swizzle
bit_revBit-reversal primitive

At least one request is always queued to avoid the performance penalty of TMU pipeline stalls. One too many requests are issued and one response must be thrown away after the loop. Pseudo-code:

    #######################################
    # Pass 1

    read_rev
    foreach step {
        load_rev
        ...
    }
    ldtmu0 # Wait for last (unwanted) response
    ldtmu0

    #######################################
    # Pass 2

    read_lin
    foreach step {
        load_lin
        ...
    }
    ldtmu0 # Wait for last (unwanted) response
    ldtmu0

Write subtleties

VPM and VDW write macros:

MacroCallerDescription
inst_vpmBoth Setup VPM column write commands for this QPU instance (0...7)
write_vpm_16Write codelet output to VPM columns
FFT_32 and FFT_64 final butterfly stage done here
write_vpm_32
write_vpm_64
body_ra_save_16Masterwrite_vpm_*, sync with slaves and queue VDW write
body_ra_save_32
body_ra_save_64
body_ra_syncSync with slaves
body_rx_save_slave_16Slaveswrite_vpm_* and sync wth master
body_rx_save_slave_32
body_rx_save_slave_64
body_rx_sync_slaveSync with master

FFT_16 and FFT_32 codelet VPM writes are double-buffered, allowing the slaves to proceed into the next step without waiting for the VDW write. FFT_16 alternates between the first 16 and second 16 VPM rows. FFT_32 alternates between the first 32 and second 32 VPM rows. FFT_64 is not double-buffered because user shaders can only use the first 64 rows. Note how the order of semaphore instructions in the FFT_64 body_*_save_* macros differs from the others. Notice also the use of VPM readbacks to ensure preceding writes have completed; and the postponement of vw_wait until the last moment.

Mailbox

GPU_FFT uses a kernel character device known as the "mailbox" for communication between the ARM and the BCM2835 VPU. Running start.elf from the SD-Card boot partition, the VPU is a powerful (and on the Pi under-used) vector processor that manages the Videocore hardware. GPU_FFT uses two mailbox API functions:
unsigned qpu_enable (int file_desc, unsigned enable);
unsigned execute_qpu(int file_desc, unsigned num_qpus, unsigned control, unsigned noflush, unsigned timeout);
The first enables or disables the GRAFX (V3D) power domain and opens a V3D driver handle for running user shaders. The second executes one or more shaders on the QPUs.

Unfortunately, calling through the mailbox involves a kernel context switch, incurring a ~ 100µs performance penalty. This is not an issue for qpu_enable() which need only be called twice; however, it is a significant overhead on execute_qpu() when computing short transforms. This is why GPU_FFT 1.0 allowed batches of transforms to be executed with a single mailbox call. GPU_FFT 2.0 still supports batch mode; but, when the expected shader runtime is below a #defined threshold, it avoids the mailbox overhead by poking V3D hardware registers directly from the ARM side and busy-waits for shader completion.

V3D hardware registers are documented in the Videocore IV V3D Architecture Specification.

Stockham autosort FFT

Stockham autosort FFT algorithms access data in "natural" order, avoiding the bit-reversal or "shuffle" step required by Cooley-Tukey. Radix-2 Stockham algorithms perform N-point transforms in log2N passes, effectively shuffling indices 1-bit per pass as they go. The number of passes can be reduced with higher radix codelets. Based on work published by Microsoft Research [1] I've prototyped a mixed-radix Stockham algorithm in "C" to investigate its properties. Replacing read_rev with read_lin in the first pass of the GPU_FFT 4k-point shader, to asses likely performance, yielded a 15% reduction in runtime.

References

1. High Performance Discrete Fourier Transforms on Graphics Processors

Links