Back to projects 
GPU_FFT release 2.0 is a Fast Fourier Transform library for the Raspberry Pi which exploits the BCM2835 SoC GPU hardware to deliver ten times more data throughput than is possible on the 700 MHz ARM. Kernels are provided for all powerof2 FFT lengths between 256 and 2,097,152 points inclusive. GPU_FFT release 1.0 supported up to 131,072 points.
Raspberry Pi Foundation CEO Eben Upton attended the Radio Society of Great Britain (RSGB) 2013 Annual Convention, where radio amateurs told him FFT performance of its 700 MHz ARM limited the Pi's usefulness in Software Defined Radio (SDR) applications. That was the impetus which led me to develop GPU_FFT; and I wish to thank Eben Upton, Dom Cobley, Tim Mamtora, Gary Keall and Dave Emett for helping.
GPU_FFT runs on the quad processing units (QPUs) in the BCM2835 V3D block. There is a huge amount of interest in generalpurpose programming of graphics hardware; and, in February 2014, Broadcom gave the Pi community a fantastic present by publishing the Videocore IV V3D Architecture Specification. Now, anybody can develop code for the GPU and some pioneers already have. VC4ASM is a fullfeatured macro assembler by Marcel Müller.
cd /opt/vc/src/hello_pi/hello_fft make sudo mknod char_dev c 100 0 sudo ./hello_fft.bin 12
These files contain all the information you need to use GPU_FFT:
hello_fft.c  Demo source code 
hello_fft_2d.c  
gpu_fft.txt  User Guide 
The hello_fft demo isn't very spectacular. It just computes some performance figures. The newly released hello_fft_2d demo is a bit more visual:
make hello_fft_2d.bin sudo ./hello_fft_2d.bin
It generated the above image. Now back to how GPU_FFT works ...
Decimation in time  Decimation in frequency  

Say we could accommodate four complex numbers in registers. Looking at the above DIT data flow graph, if we loaded x(0), x(4), x(2) and x(6) from memory, we could compute the first two butterfly stages in registers; temporarily save the result to memory; and do the same for the oddnumbered ordinals 1, 5, 3 and 7; but then we hit a problem. The inputs to the third butterfly do not fit in registers and it gets worse with more stages. We will always run out of registers eventually, even if we have quite a lot as the BCM2835 QPU does.
DIT FFT 256 classic DIT FFT 256 practical 
The classic data flow graph illustrates the above mentioned implementational challenge: the number of inputs to the butterflys doubles at each stage. In the practical flow: extra "twists" (reorderings) have been inserted after the fourth and final stages; no butterfly is then a function of more than sixteen inputs; all computation can be performed in registers using an FFT_16 codelet; and SDRAM is only read/written twice. The extra twists are bitrotations not reversals. Ordinals are circularly shifted by four. Doing so twice in an eightstage transform leaves outputs in the correct positions. Butterfly wing span doubles from 16 to 32 between the 4th and 5th stages of the classic flow. Rotating ordinals by 4bits drops wing span back down to 2 again in the practical flow.
The above PDF files were generated by a C program, Butter, which was written to visualize how twiddles change as we step through the data. You will also need libHaru to build it, or, you can run the zipped executable if you trust me. I promise it is safe. The zipped code generates a 4096point "practical" flow graph; but the #defines can be changed for other lengths.
GPU_FFT takes the high road because the VPM/VDW are good at bitrotated writes and the same write code can then be used for both passes. But why DIT? Why not DIF? Because a bitreversed write would be slow.
Endtoend data path through V3D hardware:
The TMU (texture & memory lookup unit) is good at fetching contiguous or scattered data from SDRAM, although it was not specifically optimised for bitreversed reads. The TMU cache is directlymapped, aliasing on 4k boundaries. Unfortunately, bitreversal of consecutive ordinals makes upper address bits toggle, causing collisions and evictions! Use of VC4 L2 Cache is optional and must be avoided (see GPU_FFT_MEM_* #defines) for GPU_FFT and OpenGL to coexist. It is safe for VDW output to bypass V3D internal caches, since they are smaller than the pingpong data buffers.
BCM2835 V3D has 12 QPUs; but GPU_FFT only uses 8. It happens that bitrotated writes can be done very neatly, through the VPM and VDW, by 8 QPUs working in parallel. One is master and 7 are slaves, synchronised using semaphores. QPUs can access the VPM (vertex pipeline memory) either horizontally or vertically. Each QPU outputs real and imaginary vectors to VPM columns. The master then writes VPM rows to SDRAM through the VDW as a series of bursts, with a stride between rows:
QPU_{0} 
QPU_{1} 
QPU_{6} 
QPU_{7} 

re_{i+0}  im_{i+0}  re_{i+16}  im_{i+16}  .  .  .  re_{i+96}  im_{i+96}  re_{i+112}  im_{i+112}  base  
re_{i+1}  im_{i+1}  re_{i+17}  im_{i+17}  .  .  .  re_{i+97}  im_{i+97}  re_{i+113}  im_{i+113}  base + stride  
re_{i+2}  im_{i+2}  re_{i+18}  im_{i+18}  .  .  .  re_{i+98}  im_{i+98}  re_{i+114}  im_{i+114}  base + stride*2  
re_{i+3}  im_{i+3}  re_{i+19}  im_{i+19}  .  .  .  re_{i+99}  im_{i+99}  re_{i+115}  im_{i+115}  base + stride*3  
.  .  .  .  .  .  .  .  .  .  .  
.  .  .  .  .  VPM  .  .  .  .  .  
.  .  .  .  .  .  .  .  .  .  .  
re_{i+12}  im_{i+12}  re_{i+28}  im_{i+28}  .  .  .  re_{i+108}  im_{i+108}  re_{i+124}  im_{i+124}  base + stride*12  
re_{i+13}  im_{i+13}  re_{i+29}  im_{i+29}  .  .  .  re_{i+109}  im_{i+109}  re_{i+125}  im_{i+125}  base + stride*13  
re_{i+14}  im_{i+14}  re_{i+30}  im_{i+30}  .  .  .  re_{i+110}  im_{i+110}  re_{i+126}  im_{i+126}  base + stride*14  
re_{i+15}  im_{i+15}  re_{i+31}  im_{i+31}  .  .  .  re_{i+111}  im_{i+111}  re_{i+127}  im_{i+127}  base + stride*15 
The VDW write setup STRIDE is specified as a 13bit field on page 59 of the Videocore IV V3D Architecture Specification but luckily the hardware actually supports 16bits. The value specified in the setup is 64 bytes less than the true stride shown above because the internal address register to which it is added has already advanced 16*4 bytes.
N  256  512  1024  2048  4096  8192  16384  32768  65536  128K  256K †  512K †  1M †  2M † 
log_{2}N  8  9  10  11  12  13  14  15  16  17  18  19  20  21 
Passes  2  2  2  2  3  3  3  3  3  4  4  4  4  4 
Structure *  4+4  5+4  5+5  6+5  4+4+4  5+4+4  5+5+4  5+5+5  6+5+5  5+4+4+4  5+4+4+5  5+4+5+5  5+5+5+5  6+5+5+5 
Minimising the number of passes minimises SDRAM access. So why not use even larger codelets? Short answer: there are not enough registers. For the same reason, the FFT64 codelet can only be used on the first pass because its twiddles can't be rotated.
log_{2}N=18 and longer transforms require VDW write strides exceeding 65535 bytes and must therefore issue multiple singlerow VDW write commands.
5+4+4+5 has lower rootmeansquare (rms) error than 5+5+4+4.
# # complex p(r0, r1) = complex a(r0, r1) * complex b(ra_re, rb_im) # nop; fmul r3, r1, rb_im # ii = a.im*b.im nop; fmul r2, r0, ra_re # rr = a.re*b.re nop; fmul r1, r1, ra_re # ir = a.im*b.re nop; fmul r0, r0, rb_im # ri = a.re*b.im fadd r1, r0, r1 # p.im = ri + ir fsub r0, r2, r3 # p.re = rr  ii
Pass  1  2  3  4 
Codelet  a  b  c  d 
Rotation (Δk)    c.d  d  1 
Interval (points)    b.c.d  c.d  d 
This 3pass FFT4096 flow graph helped me visualize the subtleties:
DIT FFT 4096 (6MB zip of 33MB pdf) or download Butter.zip and run the executable. 
Twiddle factors are applied to data within the codelets at the points indicated on the data flow. Each integer, k, represents a complex multiplication by
Ideally, twiddles should be calculated to a higher precision than the data; but the QPU only supports 32bit singleprecision floats. Trigonometric recurrences are used to minimise loss of precision:
Real and imaginary parts are stored at offsets from ra_tw_re and rb_tw_im respectively:
Offset  Codelet  Description 
TW16_Pn_BASE  FFT16  starting twiddles for nth pass 
TW32_Pn_BASE  FFT32  
TW64_P1_BASE0  FFT64  
TW64_P1_BASE1  
TW16_Pn_STEP  FFT16  per step rotation for nth pass 
TW32_Pn_STEP  FFT32  
TW16_ACTIVE+0  FFT16  1st stage unpacked twiddles 
TW16_ACTIVE+1  2nd stage unpacked twiddles  
TW16_ACTIVE+2  3rd stage unpacked twiddles  
TW16_ACTIVE+3  current twiddles  
TW32_ACTIVE  FFT32 
ACTIVE values are dynamically generated. BASE and STEP values are passedin through uniforms. Each of the 8 QPUs has different (TW_UNIQUE) BASE twiddles in the final pass. BASE values for earlier passes and all STEP values are (TW_SHARED) common. Currently, all uniforms are copied to registers before pass 1. Probably, they should be loaded one pass at a time.
Twiddle management macros:
load_tw  Load BASE and STEP from uniforms 
body_fft_16  FFT16 codelet 
fft_twiddles_32  FFT32 codelet helper 
unpack_twiddles  updates TW16_ACTIVE 
next_twiddles_16  rotate one step 
next_twiddles_32 
Using TW#i notation to denote the i^{ th} element of a vector, twiddles are packed in the TW16 and TW32 registers:
Element #0 of the TW16 registers is not used. Twiddles are "unpacked" to suit the 4 stages of the FFT16 codelet and conditional flags control which elements are multiplied at each stage:
.macro body_fft_16 .rep i, 4 and.setf , elem_num, (1<<i)
TW16_ACTIVE+0  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1  TW16#0  TW16#1 
TW16_ACTIVE+1  TW16#0  TW16#1  TW16#2  TW16#3  TW16#0  TW16#1  TW16#2  TW16#3  TW16#0  TW16#1  TW16#2  TW16#3  TW16#0  TW16#1  TW16#2  TW16#3 
TW16_ACTIVE+2  TW16#0  TW16#1  TW16#2  TW16#3  TW16#4  TW16#5  TW16#6  TW16#7  TW16#0  TW16#1  TW16#2  TW16#3  TW16#4  TW16#5  TW16#6  TW16#7 
TW16_ACTIVE+3  TW16#0  TW16#1  TW16#2  TW16#3  TW16#4  TW16#5  TW16#6  TW16#7  TW16#8  TW16#9  TW16#10  TW16#11  TW16#12  TW16#13  TW16#14  TW16#15 
TW32_ACTIVE  TW32#0  TW32#1  TW32#2  TW32#3  TW32#4  TW32#5  TW32#6  TW32#7  TW32#8  TW32#9  TW32#10  TW32#11  TW32#12  TW32#13  TW32#14  TW32#15 
read_rev  Queue TMU request for bitreverse or linear (sequential) ordinal read 
read_lin  
load_rev  Call read_* to queue another request and wait for response from previous 
load_lin  
interleave  Swap request/response elements to improve cacheefficiency 
swizzle  
bit_rev  Bitreversal primitive 
At least one request is always queued to avoid the performance penalty of TMU pipeline stalls. One too many requests are issued and one response must be thrown away after the loop. Pseudocode:
####################################### # Pass 1 read_rev foreach step { load_rev ... } ldtmu0 # Wait for last (unwanted) response ldtmu0 ####################################### # Pass 2 read_lin foreach step { load_lin ... } ldtmu0 # Wait for last (unwanted) response ldtmu0
Macro  Caller  Description 
inst_vpm  Both  Setup VPM column write commands for this QPU instance (0...7) 
write_vpm_16  Write codelet output to VPM columns FFT_32 and FFT_64 final butterfly stage done here  
write_vpm_32  
write_vpm_64  
body_ra_save_16  Master  write_vpm_*, sync with slaves and queue VDW write 
body_ra_save_32  
body_ra_save_64  
body_ra_sync  Sync with slaves  
body_rx_save_slave_16  Slaves  write_vpm_* and sync wth master 
body_rx_save_slave_32  
body_rx_save_slave_64  
body_rx_sync_slave  Sync with master 
FFT_16 and FFT_32 codelet VPM writes are doublebuffered, allowing the slaves to proceed into the next step without waiting for the VDW write. FFT_16 alternates between the first 16 and second 16 VPM rows. FFT_32 alternates between the first 32 and second 32 VPM rows. FFT_64 is not doublebuffered because user shaders can only use the first 64 rows. Note how the order of semaphore instructions in the FFT_64 body_*_save_* macros differs from the others. Notice also the use of VPM readbacks to ensure preceding writes have completed; and the postponement of vw_wait until the last moment.
unsigned qpu_enable (int file_desc, unsigned enable); unsigned execute_qpu(int file_desc, unsigned num_qpus, unsigned control, unsigned noflush, unsigned timeout);The first enables or disables the GRAFX (V3D) power domain and opens a V3D driver handle for running user shaders. The second executes one or more shaders on the QPUs.
Unfortunately, calling through the mailbox involves a kernel context switch, incurring a ~ 100µs performance penalty. This is not an issue for qpu_enable() which need only be called twice; however, it is a significant overhead on execute_qpu() when computing short transforms. This is why GPU_FFT 1.0 allowed batches of transforms to be executed with a single mailbox call. GPU_FFT 2.0 still supports batch mode; but, when the expected shader runtime is below a #defined threshold, it avoids the mailbox overhead by poking V3D hardware registers directly from the ARM side and busywaits for shader completion.
V3D hardware registers are documented in the Videocore IV V3D Architecture Specification.
Copyright © Andrew Holme, 2014. 