Back to projects |
GPU_FFT release 3.0 is a Fast Fourier Transform library for the Raspberry Pi which exploits the BCM2835 SoC GPU hardware to deliver ten times more data throughput than is possible on the 700 MHz ARM of the original Raspberry Pi 1. Kernels are provided for all power-of-2 FFT lengths between 256 and 4,194,304 points inclusive. Accuracy has been significantly improved, without compromising performance, using a novel floating-point precision enhancement technique.
Raspberry Pi Foundation CEO Eben Upton attended the Radio Society of Great Britain (RSGB) 2013 Annual Convention, where radio amateurs told him FFT performance of its 700 MHz ARM limited the Pi's usefulness in Software Defined Radio (SDR) applications. That was the impetus which led me to develop GPU_FFT; and I wish to thank Eben Upton, Dom Cobley, Tim Mamtora, Gary Keall and Dave Emett for helping.
GPU_FFT runs on the quad processing units (QPUs) in the BCM2835 V3D block. There is a huge amount of interest in general-purpose programming of graphics hardware; and, in February 2014, Broadcom gave the Pi community a fantastic present by publishing the Videocore IV V3D Architecture Specification. Now, anybody can develop code for the GPU and some pioneers already have. VC4ASM is a full-featured macro assembler by Marcel Müller.
Similar throughput is now achievable using NEON instructions on the Pi 2 ARM v7; however, GPU_FFT remains useful on the Pi 1 and as an example of general-purpose GPU (GPGPU) programming.
cd /opt/vc/src/hello_pi/hello_fft make sudo ./hello_fft.bin 12
These files contain all the information you need to use GPU_FFT:
hello_fft.c | Demo source code |
hello_fft_2d.c | |
gpu_fft.txt | User Guide |
The hello_fft demo isn't very spectacular. It just computes some performance figures. The newly released hello_fft_2d demo is a bit more visual:
make hello_fft_2d.bin sudo ./hello_fft_2d.bin
It generated the above image. Now back to how GPU_FFT works ...
Decimation in time | Decimation in frequency | |||||||||||||||||||||
![]() |
|
![]() |
Say we could accommodate four complex numbers in registers. Looking at the above DIT data flow graph, if we loaded x(0), x(4), x(2) and x(6) from memory, we could compute the first two butterfly stages in registers; temporarily save the result to memory; and do the same for the odd-numbered ordinals 1, 5, 3 and 7; but then we hit a problem. The inputs to the third butterfly do not fit in registers and it gets worse with more stages. We will always run out of registers eventually, even if we have quite a lot as the BCM2835 QPU does.
![]() |
DIT FFT 256 classic DIT FFT 256 practical |
The classic data flow graph illustrates the above mentioned implementational challenge: the number of inputs to the butterflys doubles at each stage. In the practical flow: extra "twists" (re-orderings) have been inserted after the fourth and final stages; no butterfly is then a function of more than sixteen inputs; all computation can be performed in registers using an FFT_16 codelet; and SDRAM is only read/written twice. The extra twists are bit-rotations not reversals. Ordinals are circularly shifted by four. Doing so twice in an eight-stage transform leaves outputs in the correct positions. Butterfly wing span doubles from 16 to 32 between the 4th and 5th stages of the classic flow. Rotating ordinals by 4-bits drops wing span back down to 2 again in the practical flow.
The above PDF files were generated by a C program, Butter, which was written to visualize how twiddles change as we step through the data. You will also need libHaru to build it, or, you can run the zipped executable if you trust me. I promise it is safe. The zipped code generates a 4096-point "practical" flow graph; but the #defines can be changed for other lengths.
GPU_FFT takes the high road because the VPM/VDW are good at bit-rotated writes and the same write code can then be used for both passes. But why DIT? Why not DIF? Because a bit-reversed write would be slow.
End-to-end data path through V3D hardware on Pi 1:
The TMU (texture & memory lookup unit) is good at fetching contiguous or scattered data from SDRAM, although it was not specifically optimised for bit-reversed reads. The TMU cache is directly-mapped, aliasing on 4k boundaries. Unfortunately, bit-reversal of consecutive ordinals makes upper address bits toggle, causing collisions and evictions! VC4 L2 Cache can only be used on Pi 1; and must be avoided for GPU_FFT and OpenGL to coexist. The ARM does not access SDRAM through VC4 L2 cache on Pi 2. It is safe for VDW output to bypass V3D internal caches, since they are smaller than the ping-pong data buffers.
BCM2835 V3D has 12 QPUs; but GPU_FFT only uses 8. It happens that bit-rotated writes can be done very neatly, through the VPM and VDW, by 8 QPUs working in parallel. One is master and 7 are slaves, synchronised using semaphores. QPUs can access the VPM (vertex pipeline memory) either horizontally or vertically. Each QPU outputs real and imaginary vectors to VPM columns. The master then writes VPM rows to SDRAM through the VDW as a series of bursts, with a stride between rows:
QPU0![]() |
QPU1![]() |
QPU6![]() |
QPU7![]() |
|||||||||
rei+0 | imi+0 | rei+16 | imi+16 | . | . | . | rei+96 | imi+96 | rei+112 | imi+112 | ![]() | base |
rei+1 | imi+1 | rei+17 | imi+17 | . | . | . | rei+97 | imi+97 | rei+113 | imi+113 | ![]() | base + stride |
rei+2 | imi+2 | rei+18 | imi+18 | . | . | . | rei+98 | imi+98 | rei+114 | imi+114 | ![]() | base + stride*2 |
rei+3 | imi+3 | rei+19 | imi+19 | . | . | . | rei+99 | imi+99 | rei+115 | imi+115 | ![]() | base + stride*3 |
. | . | . | . | . | . | . | . | . | . | . | ||
. | . | . | . | . | VPM | . | . | . | . | . | ||
. | . | . | . | . | . | . | . | . | . | . | ||
rei+12 | imi+12 | rei+28 | imi+28 | . | . | . | rei+108 | imi+108 | rei+124 | imi+124 | ![]() | base + stride*12 |
rei+13 | imi+13 | rei+29 | imi+29 | . | . | . | rei+109 | imi+109 | rei+125 | imi+125 | ![]() | base + stride*13 |
rei+14 | imi+14 | rei+30 | imi+30 | . | . | . | rei+110 | imi+110 | rei+126 | imi+126 | ![]() | base + stride*14 |
rei+15 | imi+15 | rei+31 | imi+31 | . | . | . | rei+111 | imi+111 | rei+127 | imi+127 | ![]() | base + stride*15 |
The VDW write setup STRIDE is specified as a 13-bit field on page 59 of the Videocore IV V3D Architecture Specification but luckily the hardware actually supports 16-bits. The value specified in the setup is 64 bytes less than the true stride shown above because the internal address register to which it is added has already advanced 16*4 bytes.
N | 256 | 512 | 1024 | 2048 | 4096 | 8192 | 16384 | 32768 | 65536 | 128K | 256K | 512K | 1M | 2M | 4M |
log2N | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 |
Passes | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 4 |
Structure † | 4+4 | 5+4 | 5+5 | 6+5 | 4+4+4 | 5+4+4 | 5+5+4 | 5+5+5 | 6+5+5 | 5+4+4+4 | 5+4+4+5 | 5+4+5+5 | 5+5+5+5 | 6+5+5+5 | 6+6+5+5 |
Minimising the number of passes minimises SDRAM access.
log2N=18 and longer transforms require VDW write strides exceeding 65535 bytes and must therefore issue multiple single-row VDW write commands.
5+4+4+5 has lower root-mean-square (rms) error than 5+5+4+4.
128k and 256k could be done as 6+6+5 and 6+6+6. When these shaders were originally written, the FFT-64 codelet could only be used in the first pass. Now it can be used in any pass and they could be optimised.
# # complex p(r0, r1) = complex a(r0, r1) * complex b(ra_re, rb_im) # nop; fmul r3, r1, rb_im # ii = a.im*b.im nop; fmul r2, r0, ra_re # rr = a.re*b.re nop; fmul r1, r1, ra_re # ir = a.im*b.re nop; fmul r0, r0, rb_im # ri = a.re*b.im fadd r1, r0, r1 # p.im = ri + ir fsub r0, r2, r3 # p.re = rr - ii
Pass | 1 | 2 | 3 | 4 |
Codelet | a | b | c | d |
Rotation (Δk) | - | c.d | d | 1 |
Interval (points) | - | b.c.d | c.d | d |
This 3-pass FFT-4096 flow graph helped me visualize the subtleties:
![]() |
DIT FFT 4096 (6MB zip of 33MB pdf) or download Butter.zip and run the executable. |
Twiddle factors are applied to data within the codelets at the points indicated on the data flow.
Each integer, k, represents a complex multiplication by
Ideally, twiddles should be calculated to a higher precision than the data; but the QPU only supports 32-bit single-precision floats. Trigonometric recurrences are used to minimise loss of precision:
![]() | ![]() |
Unfortunately, the QPU always rounds down (not to the nearest LSB) and significant rounding errors accumulated, seriously compromising accuracy in releases 1.0 and 2.0 of GPU_FFT. This has been addressed in release 3.0 using a precision enhancement technique based on the work of Dekker [2] and Thall [3] which corrects rounding errors, improving accuracy by almost three orders of magnitude in the longer transforms!
Thall and Dekker use the "unevaluated sum" of two single-precision floats to represent one double-precision quantity and provide primitives for addition, subtraction and multiplication. To avoid a performance hit, only the cosθ and sinθ terms are maintained at higher precision; and only the final two subtractions of the recurrence are evaluated using Thall's primitives. Two QPU instructions per iteration became twenty-six.
Real and imaginary parts are stored at offsets from ra_tw_re and rb_tw_im respectively:
Codelet | Offset | Description |
FFT-16 | TW16_Pn_STEP | per-step α, β rotation for nth pass |
FFT-32 | TW32_Pn_STEP | |
FFT-64 | TW48_Pn_STEP | |
TW64_Pn_STEP | ||
FFT-16 | TW16+0 | 1st stage unpacked twiddles |
TW16+1 | 2nd stage unpacked twiddles | |
TW16+2 | 3rd stage unpacked twiddles | |
TW16+3, 4 | high-precision cosθ, sinθ | |
FFT-32 | TW32+0, 1 | |
FFT-64 | TW48+0, 1 | |
TW64+0, 1 |
STEP values and starting values for cosθ, sinθ are loaded from memory addresses passed-in through uniforms. Each of the 8 QPUs has unique (rx_tw_unique) starting values in the final pass. Other values are common (rx_tw_shared) to all shaders.
Twiddle management macros:
load_tw | Load α, β and starting cosθ, sinθ |
body_fft_16 | FFT-16 codelet |
fft_twiddles_32 | FFT-32 codelet helper |
unpack_twiddles | updates TW16 |
rotate | rotate one step |
next_twiddles_16 | |
next_twiddles_32 | |
df64_sub32 | Enhanced-precision subtract |
Using TW#i notation to denote the i th element of a vector, twiddles are packed in the TW16 and TW32 registers:
Element #0 of the TW16 registers is not used. Twiddles are "unpacked" to suit the 4 stages of the FFT-16 codelet and conditional flags control which elements are multiplied at each stage:
.macro body_fft_16 .rep i, 4 and.setf -, elem_num, (1<<i)
TW16+0 | TW16#0 | TW16#1 | TW16#0 | TW16#1 | TW16#0 | TW16#1 | TW16#0 | TW16#1 | TW16#0 | TW16#1 | TW16#0 | TW16#1 | TW16#0 | TW16#1 | TW16#0 | TW16#1 |
TW16+1 | TW16#0 | TW16#1 | TW16#2 | TW16#3 | TW16#0 | TW16#1 | TW16#2 | TW16#3 | TW16#0 | TW16#1 | TW16#2 | TW16#3 | TW16#0 | TW16#1 | TW16#2 | TW16#3 |
TW16+2 | TW16#0 | TW16#1 | TW16#2 | TW16#3 | TW16#4 | TW16#5 | TW16#6 | TW16#7 | TW16#0 | TW16#1 | TW16#2 | TW16#3 | TW16#4 | TW16#5 | TW16#6 | TW16#7 |
TW16+3 | TW16#0 | TW16#1 | TW16#2 | TW16#3 | TW16#4 | TW16#5 | TW16#6 | TW16#7 | TW16#8 | TW16#9 | TW16#10 | TW16#11 | TW16#12 | TW16#13 | TW16#14 | TW16#15 |
TW32 | TW32#0 | TW32#1 | TW32#2 | TW32#3 | TW32#4 | TW32#5 | TW32#6 | TW32#7 | TW32#8 | TW32#9 | TW32#10 | TW32#11 | TW32#12 | TW32#13 | TW32#14 | TW32#15 |
read_rev | Queue TMU request for bit-reverse or linear (sequential) ordinal read |
read_lin | |
load_rev | Call read_* to queue another request and wait for response from previous |
load_lin | |
interleave | Swap request/response elements to improve cache-efficiency |
swizzle | |
bit_rev | Bit-reversal primitive |
At least one request is always queued to avoid the performance penalty of TMU pipeline stalls. One too many requests are issued and one response must be thrown away after the loop. Pseudo-code:
####################################### # Pass 1 read_rev foreach step { load_rev ... } ldtmu0 # Wait for last (unwanted) response ldtmu0 ####################################### # Pass 2 read_lin foreach step { load_lin ... } ldtmu0 # Wait for last (unwanted) response ldtmu0
Macro | Caller | Description |
inst_vpm | Both | Setup VPM column write commands for this QPU instance (0...7) |
write_vpm_16 | Write codelet output to VPM columns FFT_32 and FFT_64 final butterfly stage done here | |
write_vpm_32 | ||
write_vpm_64 | ||
body_ra_save_16 | Master | write_vpm_*, sync with slaves and queue VDW write |
body_ra_save_32 | ||
body_ra_save_64 | ||
body_ra_sync | Sync with slaves | |
body_rx_save_slave_16 | Slaves | write_vpm_* and sync wth master |
body_rx_save_slave_32 | ||
body_rx_save_slave_64 | ||
body_rx_sync_slave | Sync with master |
FFT_16 and FFT_32 codelet VPM writes are double-buffered, allowing the slaves to proceed into the next step without waiting for the VDW write. FFT_16 alternates between the first 16 and second 16 VPM rows. FFT_32 alternates between the first 32 and second 32 VPM rows. FFT_64 is not double-buffered because user shaders can only use the first 64 rows. Note how the order of semaphore instructions in the FFT_64 body_*_save_* macros differs from the others. Notice also the use of VPM readbacks to ensure preceding writes have completed; and the postponement of vw_wait until the last moment.
unsigned qpu_enable (int file_desc, unsigned enable); unsigned execute_qpu(int file_desc, unsigned num_qpus, unsigned control, unsigned noflush, unsigned timeout);The first enables or disables the GRAFX (V3D) power domain and opens a V3D driver handle for running user shaders. The second executes one or more shaders on the QPUs.
Unfortunately, calling through the mailbox involves a kernel context switch, incurring a ~ 100µs performance penalty. This is not an issue for qpu_enable() which need only be called twice; however, it is a significant overhead on execute_qpu() when computing short transforms. This is why GPU_FFT 1.0 allowed batches of transforms to be executed with a single mailbox call. GPU_FFT 3.0 still supports batch mode; but, when the expected shader runtime is below a #defined threshold, it avoids the mailbox overhead by poking V3D hardware registers directly from the ARM side and busy-waits for shader completion.
V3D hardware registers are documented in the Videocore IV V3D Architecture Specification.
Currently, programs which use the mailbox must run as root; however, Marcel Müller is developing vcio2 a kernel driver to access the Raspberry Pi VideoCore IV GPU without root privileges.
The mailbox special character device file has moved from ./char_dev to /dev/vcio in the latest release.
Stockham autosort FFT algorithms access data in "natural" order, avoiding the bit-reversal or "shuffle" step required by Cooley-Tukey. Radix-2 Stockham algorithms perform N-point transforms in log2N passes, effectively shuffling indices 1-bit per pass as they go. The number of passes can be reduced with higher radix codelets. Based on work published by Microsoft Research [1], a mixed-radix Stockham routine with twiddle look-up was prototyped, first in "C" and then in QPU assembler.
The above experiments produced modest accuracy improvements; but throughput was disappointing and performance counters implicated TMU cache aliasing.
Copyright © Andrew Holme, 2014. |
![]() ![]() |