Back to projects 
GPU_FFT release 3.0 is a Fast Fourier Transform library for the Raspberry Pi which exploits the BCM2835 SoC GPU hardware to deliver ten times more data throughput than is possible on the 700 MHz ARM of the original Raspberry Pi 1. Kernels are provided for all powerof2 FFT lengths between 256 and 4,194,304 points inclusive. Accuracy has been significantly improved, without compromising performance, using a novel floatingpoint precision enhancement technique.
Raspberry Pi Foundation CEO Eben Upton attended the Radio Society of Great Britain (RSGB) 2013 Annual Convention, where radio amateurs told him FFT performance of its 700 MHz ARM limited the Pi's usefulness in Software Defined Radio (SDR) applications. That was the impetus which led me to develop GPU_FFT; and I wish to thank Eben Upton, Dom Cobley, Tim Mamtora, Gary Keall and Dave Emett for helping.
GPU_FFT runs on the quad processing units (QPUs) in the BCM2835 V3D block. There is a huge amount of interest in generalpurpose programming of graphics hardware; and, in February 2014, Broadcom gave the Pi community a fantastic present by publishing the Videocore IV V3D Architecture Specification. Now, anybody can develop code for the GPU and some pioneers already have. VC4ASM is a fullfeatured macro assembler by Marcel Müller.
Similar throughput is now achievable using NEON instructions on the Pi 2 ARM v7; however, GPU_FFT remains useful on the Pi 1 and as an example of generalpurpose GPU (GPGPU) programming.
cd /opt/vc/src/hello_pi/hello_fft make sudo ./hello_fft.bin 12
These files contain all the information you need to use GPU_FFT:
hello_fft.c  Demo source code 
hello_fft_2d.c  
gpu_fft.txt  User Guide 
The hello_fft demo isn't very spectacular. It just computes some performance figures. The newly released hello_fft_2d demo is a bit more visual:
make hello_fft_2d.bin sudo ./hello_fft_2d.bin
It generated the above image. Now back to how GPU_FFT works ...
Decimation in time  Decimation in frequency  

Say we could accommodate four complex numbers in registers. Looking at the above DIT data flow graph, if we loaded x(0), x(4), x(2) and x(6) from memory, we could compute the first two butterfly stages in registers; temporarily save the result to memory; and do the same for the oddnumbered ordinals 1, 5, 3 and 7; but then we hit a problem. The inputs to the third butterfly do not fit in registers and it gets worse with more stages. We will always run out of registers eventually, even if we have quite a lot as the BCM2835 QPU does.
DIT FFT 256 classic DIT FFT 256 practical 
The classic data flow graph illustrates the above mentioned implementational challenge: the number of inputs to the butterflys doubles at each stage. In the practical flow: extra "twists" (reorderings) have been inserted after the fourth and final stages; no butterfly is then a function of more than sixteen inputs; all computation can be performed in registers using an FFT_16 codelet; and SDRAM is only read/written twice. The extra twists are bitrotations not reversals. Ordinals are circularly shifted by four. Doing so twice in an eightstage transform leaves outputs in the correct positions. Butterfly wing span doubles from 16 to 32 between the 4th and 5th stages of the classic flow. Rotating ordinals by 4bits drops wing span back down to 2 again in the practical flow.
The above PDF files were generated by a C program, Butter, which was written to visualize how twiddles change as we step through the data. You will also need libHaru to build it, or, you can run the zipped executable if you trust me. I promise it is safe. The zipped code generates a 4096point "practical" flow graph; but the #defines can be changed for other lengths.
GPU_FFT takes the high road because the VPM/VDW are good at bitrotated writes and the same write code can then be used for both passes. But why DIT? Why not DIF? Because a bitreversed write would be slow.
Endtoend data path through V3D hardware on Pi 1:
The TMU (texture & memory lookup unit) is good at fetching contiguous or scattered data from SDRAM, although it was not specifically optimised for bitreversed reads. The TMU cache is directlymapped, aliasing on 4k boundaries. Unfortunately, bitreversal of consecutive ordinals makes upper address bits toggle, causing collisions and evictions! VC4 L2 Cache can only be used on Pi 1; and must be avoided for GPU_FFT and OpenGL to coexist. The ARM does not access SDRAM through VC4 L2 cache on Pi 2. It is safe for VDW output to bypass V3D internal caches, since they are smaller than the pingpong data buffers.
BCM2835 V3D has 12 QPUs; but GPU_FFT only uses 8. It happens that bitrotated writes can be done very neatly, through the VPM and VDW, by 8 QPUs working in parallel. One is master and 7 are slaves, synchronised using semaphores. QPUs can access the VPM (vertex pipeline memory) either horizontally or vertically. Each QPU outputs real and imaginary vectors to VPM columns. The master then writes VPM rows to SDRAM through the VDW as a series of bursts, with a stride between rows:
QPU_{0} 
QPU_{1} 
QPU_{6} 
QPU_{7} 

re_{i+0}  im_{i+0}  re_{i+16}  im_{i+16}  .  .  .  re_{i+96}  im_{i+96}  re_{i+112}  im_{i+112}  base  
re_{i+1}  im_{i+1}  re_{i+17}  im_{i+17}  .  .  .  re_{i+97}  im_{i+97}  re_{i+113}  im_{i+113}  base + stride  
re_{i+2}  im_{i+2}  re_{i+18}  im_{i+18}  .  .  .  re_{i+98}  im_{i+98}  re_{i+114}  im_{i+114}  base + stride*2  
re_{i+3}  im_{i+3}  re_{i+19}  im_{i+19}  .  .  .  re_{i+99}  im_{i+99}  re_{i+115}  im_{i+115}  base + stride*3  
.  .  .  .  .  .  .  .  .  .  .  
.  .  .  .  .  VPM  .  .  .  .  .  
.  .  .  .  .  .  .  .  .  .  .  
re_{i+12}  im_{i+12}  re_{i+28}  im_{i+28}  .  .  .  re_{i+108}  im_{i+108}  re_{i+124}  im_{i+124}  base + stride*12  
re_{i+13}  im_{i+13}  re_{i+29}  im_{i+29}  .  .  .  re_{i+109}  im_{i+109}  re_{i+125}  im_{i+125}  base + stride*13  
re_{i+14}  im_{i+14}  re_{i+30}  im_{i+30}  .  .  .  re_{i+110}  im_{i+110}  re_{i+126}  im_{i+126}  base + stride*14  
re_{i+15}  im_{i+15}  re_{i+31}  im_{i+31}  .  .  .  re_{i+111}  im_{i+111}  re_{i+127}  im_{i+127}  base + stride*15 
The VDW write setup STRIDE is specified as a 13bit field on page 59 of the Videocore IV V3D Architecture Specification but luckily the hardware actually supports 16bits. The value specified in the setup is 64 bytes less than the true stride shown above because the internal address register to which it is added has already advanced 16*4 bytes.
N  256  512  1024  2048  4096  8192  16384  32768  65536  128K  256K  512K  1M  2M  4M 
log_{2}N  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22 
Passes  2  2  2  2  3  3  3  3  3  4  4  4  4  4  4 
Structure †  4+4  5+4  5+5  6+5  4+4+4  5+4+4  5+5+4  5+5+5  6+5+5  5+4+4+4  5+4+4+5  5+4+5+5  5+5+5+5  6+5+5+5  6+6+5+5 
Minimising the number of passes minimises SDRAM access.
log_{2}N=18 and longer transforms require VDW write strides exceeding 65535 bytes and must therefore issue multiple singlerow VDW write commands.
5+4+4+5 has lower rootmeansquare (rms) error than 5+5+4+4.
128k and 256k could be done as 6+6+5 and 6+6+6. When these shaders were originally written, the FFT64 codelet could only be used in the first pass. Now it can be used in any pass and they could be optimised.
# # complex p(r0, r1) = complex a(r0, r1) * complex b(ra_re, rb_im) # nop; fmul r3, r1, rb_im # ii = a.im*b.im nop; fmul r2, r0, ra_re # rr = a.re*b.re nop; fmul r1, r1, ra_re # ir = a.im*b.re nop; fmul r0, r0, rb_im # ri = a.re*b.im fadd r1, r0, r1 # p.im = ri + ir fsub r0, r2, r3 # p.re = rr  ii
Pass  1  2  3  4 
Codelet  a  b  c  d 
Rotation (Δk)    c.d  d  1 
Interval (points)    b.c.d  c.d  d 
This 3pass FFT4096 flow graph helped me visualize the subtleties:
DIT FFT 4096 (6MB zip of 33MB pdf) or download Butter.zip and run the executable. 
Twiddle factors are applied to data within the codelets at the points indicated on the data flow. Each integer, k, represents a complex multiplication by
Ideally, twiddles should be calculated to a higher precision than the data; but the QPU only supports 32bit singleprecision floats. Trigonometric recurrences are used to minimise loss of precision:
Unfortunately, the QPU always rounds down (not to the nearest LSB) and significant rounding errors accumulated, seriously compromising accuracy in releases 1.0 and 2.0 of GPU_FFT. This has been addressed in release 3.0 using a precision enhancement technique based on the work of Dekker [2] and Thall [3] which corrects rounding errors, improving accuracy by almost three orders of magnitude in the longer transforms!
Thall and Dekker use the "unevaluated sum" of two singleprecision floats to represent one doubleprecision quantity and provide primitives for addition, subtraction and multiplication. To avoid a performance hit, only the cosθ and sinθ terms are maintained at higher precision; and only the final two subtractions of the recurrence are evaluated using Thall's primitives. Two QPU instructions per iteration became twentysix.
Real and imaginary parts are stored at offsets from ra_tw_re and rb_tw_im respectively:
Codelet  Offset  Description 
FFT16  TW16_Pn_STEP  perstep α, β rotation for nth pass 
FFT32  TW32_Pn_STEP  
FFT64  TW48_Pn_STEP  
TW64_Pn_STEP  
FFT16  TW16+0  1st stage unpacked twiddles 
TW16+1  2nd stage unpacked twiddles  
TW16+2  3rd stage unpacked twiddles  
TW16+3, 4  highprecision cosθ, sinθ  
FFT32  TW32+0, 1  
FFT64  TW48+0, 1  
TW64+0, 1 
STEP values and starting values for cosθ, sinθ are loaded from memory addresses passedin through uniforms. Each of the 8 QPUs has unique (rx_tw_unique) starting values in the final pass. Other values are common (rx_tw_shared) to all shaders.
Twiddle management macros:
load_tw  Load α, β and starting cosθ, sinθ 
body_fft_16  FFT16 codelet 
fft_twiddles_32  FFT32 codelet helper 
unpack_twiddles  updates TW16 
rotate  rotate one step 
next_twiddles_16  
next_twiddles_32  
df64_sub32  Enhancedprecision subtract 
Using TW#i notation to denote the i^{ th} element of a vector, twiddles are packed in the TW16 and TW32 registers:
Element #0 of the TW16 registers is not used. Twiddles are "unpacked" to suit the 4 stages of the FFT16 codelet and conditional flags control which elements are multiplied at each stage:
.macro body_fft_16 .rep i, 4 and.setf , elem_num, (1<<i)
TW16+0  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1 
TW16+1  TW16#0  TW16#1  TW16#2  TW16#3  TW16#0  TW16#1  TW16#2  TW16#3  TW16#0  TW16#1  TW16#2  TW16#3  TW16#0  TW16#1  TW16#2  TW16#3 
TW16+2  TW16#0  TW16#1  TW16#2  TW16#3  TW16#4  TW16#5  TW16#6  TW16#7  TW16#0  TW16#1  TW16#2  TW16#3  TW16#4  TW16#5  TW16#6  TW16#7 
TW16+3  TW16#0  TW16#1  TW16#2  TW16#3  TW16#4  TW16#5  TW16#6  TW16#7  TW16#8  TW16#9  TW16#10  TW16#11  TW16#12  TW16#13  TW16#14  TW16#15 
TW32  TW32#0  TW32#1  TW32#2  TW32#3  TW32#4  TW32#5  TW32#6  TW32#7  TW32#8  TW32#9  TW32#10  TW32#11  TW32#12  TW32#13  TW32#14  TW32#15 
read_rev  Queue TMU request for bitreverse or linear (sequential) ordinal read 
read_lin  
load_rev  Call read_* to queue another request and wait for response from previous 
load_lin  
interleave  Swap request/response elements to improve cacheefficiency 
swizzle  
bit_rev  Bitreversal primitive 
At least one request is always queued to avoid the performance penalty of TMU pipeline stalls. One too many requests are issued and one response must be thrown away after the loop. Pseudocode:
####################################### # Pass 1 read_rev foreach step { load_rev ... } ldtmu0 # Wait for last (unwanted) response ldtmu0 ####################################### # Pass 2 read_lin foreach step { load_lin ... } ldtmu0 # Wait for last (unwanted) response ldtmu0
Macro  Caller  Description 
inst_vpm  Both  Setup VPM column write commands for this QPU instance (0...7) 
write_vpm_16  Write codelet output to VPM columns FFT_32 and FFT_64 final butterfly stage done here  
write_vpm_32  
write_vpm_64  
body_ra_save_16  Master  write_vpm_*, sync with slaves and queue VDW write 
body_ra_save_32  
body_ra_save_64  
body_ra_sync  Sync with slaves  
body_rx_save_slave_16  Slaves  write_vpm_* and sync wth master 
body_rx_save_slave_32  
body_rx_save_slave_64  
body_rx_sync_slave  Sync with master 
FFT_16 and FFT_32 codelet VPM writes are doublebuffered, allowing the slaves to proceed into the next step without waiting for the VDW write. FFT_16 alternates between the first 16 and second 16 VPM rows. FFT_32 alternates between the first 32 and second 32 VPM rows. FFT_64 is not doublebuffered because user shaders can only use the first 64 rows. Note how the order of semaphore instructions in the FFT_64 body_*_save_* macros differs from the others. Notice also the use of VPM readbacks to ensure preceding writes have completed; and the postponement of vw_wait until the last moment.
unsigned qpu_enable (int file_desc, unsigned enable); unsigned execute_qpu(int file_desc, unsigned num_qpus, unsigned control, unsigned noflush, unsigned timeout);The first enables or disables the GRAFX (V3D) power domain and opens a V3D driver handle for running user shaders. The second executes one or more shaders on the QPUs.
Unfortunately, calling through the mailbox involves a kernel context switch, incurring a ~ 100µs performance penalty. This is not an issue for qpu_enable() which need only be called twice; however, it is a significant overhead on execute_qpu() when computing short transforms. This is why GPU_FFT 1.0 allowed batches of transforms to be executed with a single mailbox call. GPU_FFT 3.0 still supports batch mode; but, when the expected shader runtime is below a #defined threshold, it avoids the mailbox overhead by poking V3D hardware registers directly from the ARM side and busywaits for shader completion.
V3D hardware registers are documented in the Videocore IV V3D Architecture Specification.
Currently, programs which use the mailbox must run as root; however, Marcel Müller is developing vcio2 a kernel driver to access the Raspberry Pi VideoCore IV GPU without root privileges.
The mailbox special character device file has moved from ./char_dev to /dev/vcio in the latest release.
Stockham autosort FFT algorithms access data in "natural" order, avoiding the bitreversal or "shuffle" step required by CooleyTukey. Radix2 Stockham algorithms perform Npoint transforms in log2N passes, effectively shuffling indices 1bit per pass as they go. The number of passes can be reduced with higher radix codelets. Based on work published by Microsoft Research [1], a mixedradix Stockham routine with twiddle lookup was prototyped, first in "C" and then in QPU assembler.
The above experiments produced modest accuracy improvements; but throughput was disappointing and performance counters implicated TMU cache aliasing.
Copyright © Andrew Holme, 2014. 