The first thing I learned about teaching myself the FFT algorithm is how poorly many Internet resources describe the implementation! This blog is not intended to describe the FFT itself but an implementation in VHDL that scales using generics and without software tools that generate IP cores. It is assumed that you already understand the FFT algorithm in principle or at least have access to decent educational resources. Here I will write a working simulatable and hopefully synthesisable VHDL Core configured through generics for the popular Cooley-Tukey algorithm using decimation in time (DIT). The desire is to explore how different radices are implemented, and even generalise the VHDL implementation to a 'Radix-n' where 'n' is a generic to the entity.
- Aims
- Scope
- Plan
- Practical Considerations
- Testing
- Construction
- Fourier Matrices
- Radix-2 FFT Butterflies
- Radix-2 VHDL Code
- 'real' Package
- Function 'bit_reverse'
- Function 'array_reverse'
- Function 'init_twiddles_half'
- Function 'twid_mod'
- Function 'twiddle_power'
- Function 'operation'
- Function 'opt_pwr_arr'
- Function 'part_pwr_arr'
- Entity Specification
- Radix-2 Architecture
- End of Part 1
- Postscript
- References
Aims
The main aim above all else is to learn how to implement the Fast Fourier Transform algorithm. Anything else is a bonus.
The design:
- will implement a fully parallel FFT such that it does not need to worry about storage of intermediate products in RAM, and organising the layout of those intermediate products for efficient retrieval;
- will ignore practical device size constraints, e.g. the number of available multipliers, and concentrate on implementing the algorithm;
- should be instrumented to report arithmetic usage (multipliers and adders) for a given number of inputs and radix at simulation time, these results will then be compared to theory to verify correct construction of \(n\log(n)\) structures;
- should be sensibly pipelined, e.g. after each stage of arithmetic, with future pointers for more optimal pipelining being noted;
- should cater for any radix the user cares to specify so that size comparisons may be explored between different radix options.
Unlike with previous recursive implementations, the designer will not be able to specify the delay through the instantiation in order to be able to match the delay in any parallel logic. For now it will be enough just to be able to realise the structure at all. The number of inputs will be constrained to a power of 2 since we are using the Cooley-Tukey algorithm.
Given the state of online literature that often contradicts, or assumes e.g. decimation in frequency or time without stating, the biggest challenge is going to be deciphering how the second and subsequent stages of FFT application should be derived, i.e. the larger butterflies, especially for the larger radices.
Scope
I want to create a mathematically and structurally correct solution, which is already ambitious in terms of aiming for a multiple-radix implementation, other opportunities will be curtailed. As an example, it is essential that a 32-point radix-4 design can finish off the final stage of processing with a radix-2 design (since 32 is not a power of 4). So the implementation must be mixed-radix only to the extent necessary. The following image shows the limits of what will be attempted and defines the scope of the implementation.

I'm not sure if the term "mixed radix" is intended solely to describe the more arbitrary mixing in the bottom left figure, or if it is also used to describe the simpler application to entire layers. I shall not attempt the more arbitrary mixing since that needs some sort of topology prescription or a self optimising function to select which variety to use, and structurally it is far too involved! The only mixing that will be used is to select the radix for the final stage instantiated to use up the inputs that do not neatly fit into a power of the intended radix size.
Plan
There is an element of 'boot strapping' oneself up in knowledge here. If the radix-2 implementation can be understood, it can be used to extrapolate to the radix-4 and even provide comparisons every two stages of radix-2 against a single stage of radix-4. There is also a simplification step that can use the unsynthesisable real before moving to sfixed with fixed point rounding errors and bit slicing or arithmetic outputs to contend with. The steps used to construct the final solution are illustrated below.

With each increment in radix comes a complication in constructing the butterflies. The least well explained butterflies are the second and subsequent stages. The later stages of radix-4 were challenging and increasingly complicated with the move to radix-8. Generalisation for radix-n is possible only by using nested loops that ultimately obfuscated their derivation. Hence the source code provided retains the fixed radix implementations as non-optimal (non-elegant) working versions that retain some explanation as to how the radix-n implementations were derived.
Practical Considerations
Testing
Repeatable pass-fail tests were essential to prevent small changes escalating into a total failure. Initial known good results with worked explanations could be found on YouTube, but necessarily they only covered 8-point radix-2 implementations. Two videos I used for test data are listed here:
- An example on DIT-FFT of an 8-point sequence
- 8 Point DIT (Decimation In Time) FFT FlowGraph (Problems) - Discrete Time Signal Processing
For more extensive testing on larger FFT point sizes I generated test data using GNU Octave. The results were quickly too cumbersome to transcribe into VHDL data types, and the so Octave source file includes a means to write out the input and output matrices as a VHDL package containing the test data to be used. In the final test benches, multiple devices under test (DUTs) are instantiated each with a different generic configuration, and using the test data appropriate for the point size.

x4=[
1, -1, 2, 1;
1+i, 0, -1, 2;
0, 5-i, 2, -2;
0, 0, 1, -1
];
% Initial 8-point Radix-2 Decimation in time FFT test data
% x8(:,1) comes from the worked example at https://www.youtube.com/watch?v=AF71Yqo7CoY
% x8(:,2) comes from the worked example at https://www.youtube.com/watch?v=xnVaHkRaJOw
x8=[
1, -1, 2, 1;
1, 0, -1, 2;
1, 2, 5, 3;
0, 0, 1, 4;
0, -4, 5, -4;
0, 0, -1, -3;
0, 2, 2, -2;
0, 0, 1, -1
];
x16=[
1, -1, 2, 1+i;
1, 0, -1, 2;
1, 2, 5, 3;
0, 0, 1, 4-i;
0, -4, 5, 5;
0, 0, -1, 6;
0, 2, 2, 7-i;
0, 0, 1, 8;
1, -1, 2, -8;
1, 0, -1, -7;
1, 2, 5, -6+i;
0, 0, 1, -5;
0, -4, 5, -4;
0, 0, -1, -3+i;
0, 2, 2, -2;
0, 0, 1, -1
];
x32=[
1, -1;
1, 0;
1, 2;
0, 0;
0, -8;
0, 0;
0, 2;
0, 0;
1, -1;
1, 0;
1, 2;
0, 0;
0, -4;
0, 0;
0, 2;
0, 0;
1, -1;
1, 0;
1, 2;
0, 5;
0, -4;
0, -1;
0, 2;
0, 0;
1, -1;
1, 1;
1, 2;
0, 0;
0, -4;
0, 2;
0, 2;
0, 0
];
% Now try superimposing two tones of different frequency and amplitude
t = 0:1:511;
f1 = sin(0.2*t);
f2 = sin(0.1+0.3*t);
% Transpose the summation matrix from rows to columns
x512=(f1 + (0.5 * f2))';
% Uncomment to see the FFT of the superimposed pair of sine waves.
%plot(t, abs(fft(x512)))
function head(fid, package_name)
fprintf(fid, '-------------------------------------------------------------------------------------\n');
fprintf(fid, '--\n');
fprintf(fid, '-- Distributed under MIT Licence\n');
fprintf(fid, '-- See https://github.com/philipabbey/fpga/blob/main/LICENCE.\n');
fprintf(fid, '--\n');
fprintf(fid, '-------------------------------------------------------------------------------------\n');
fprintf(fid, '--\n');
fprintf(fid, '-- %s.vhdl\n', package_name);
fprintf(fid, '-- This file of test data was generated by Octave.\n');
fprintf(fid, '--\n');
fprintf(fid, '-- References:\n');
fprintf(fid, '-- 1) https://www.youtube.com/watch?v=AF71Yqo7CoY\n');
fprintf(fid, '-- 2) https://www.youtube.com/watch?v=xnVaHkRaJOw\n');
fprintf(fid, '--\n');
fprintf(fid, '-- P A Abbey, 2 Sep 2021\n');
fprintf(fid, '--\n');
fprintf(fid, '-------------------------------------------------------------------------------------\n');
fprintf(fid, '\n');
fprintf(fid, 'use work.test_fft_pkg.complex_vector_arr_t;\n');
fprintf(fid, '\n');
fprintf(fid, 'package %s is\n', package_name);
fprintf(fid, '\n');
end
function vhdl_test_data(fid, data, const_str)
points=size(data)(1); % Rows, n-opint FFT inputs, i.e. time series
sets=size(data)(2); % Columns, array of inputs
col=1;
fprintf(fid, ' constant %s : complex_vector_arr_t(open)(0 to %d) := (\n', const_str, points-1);
for col=1:sets
fprintf(fid, ' %d => (\n', col);
for row=1:points
% Final comma must be omitted for VHDL array
if (row < points)
comma=",";
else
comma="";
endif
z=data(row, col);
fprintf(fid, ' (re => %11.6f, im => %11.6f)%s\n', real(z), imag(z), comma);
end
if (col < sets)
fprintf(fid, ' ),\n');
else
fprintf(fid, ' )\n');
fprintf(fid, ' );\n');
end
end
fprintf(fid, '\n');
end
function tail(fid, package_name)
fprintf(fid, ' function test_data_inputs(p : positive) return complex_vector_arr_t;\n');
fprintf(fid, '\n');
fprintf(fid, ' function test_data_outputs(p : positive) return complex_vector_arr_t;\n');
fprintf(fid, '\n');
fprintf(fid, 'end package;\n');
fprintf(fid, '\n\n');
fprintf(fid, 'package body %s is\n', package_name);
fprintf(fid, '\n');
fprintf(fid, ' function test_data_inputs(p : positive) return complex_vector_arr_t is\n');
fprintf(fid, ' begin\n');
fprintf(fid, ' case p is\n');
fprintf(fid, ' when 4 => return work.test_data_fft_pkg.input_data_point4_c;\n');
fprintf(fid, ' when 8 => return work.test_data_fft_pkg.input_data_point8_c;\n');
fprintf(fid, ' when 16 => return work.test_data_fft_pkg.input_data_point16_c;\n');
fprintf(fid, ' when 32 => return work.test_data_fft_pkg.input_data_point32_c;\n');
fprintf(fid, ' when 512 => return work.test_data_fft_pkg.input_data_point512_c;\n');
fprintf(fid, ' when others =>\n');
fprintf(fid, " report \"Missing test data for \" & integer\'image(p) severity error;\n");
fprintf(fid, ' -- A nonsense value to keep the compiler quiet about not returning a anything\n');
fprintf(fid, ' return (0 => (0 => (re => 0.0, im => 0.0)));\n');
fprintf(fid, ' end case;\n');
fprintf(fid, ' end function;\n');
fprintf(fid, '\n\n');
fprintf(fid, ' function test_data_outputs(p : positive) return complex_vector_arr_t is\n');
fprintf(fid, ' begin\n');
fprintf(fid, ' case p is\n');
fprintf(fid, ' when 4 => return work.test_data_fft_pkg.output_data_point4_c;\n');
fprintf(fid, ' when 8 => return work.test_data_fft_pkg.output_data_point8_c;\n');
fprintf(fid, ' when 16 => return work.test_data_fft_pkg.output_data_point16_c;\n');
fprintf(fid, ' when 32 => return work.test_data_fft_pkg.output_data_point32_c;\n');
fprintf(fid, ' when 512 => return work.test_data_fft_pkg.output_data_point512_c;\n');
fprintf(fid, ' when others =>\n');
fprintf(fid, " report \"Missing test data for \" & integer\'image(p) severity error;\n");
fprintf(fid, ' -- A nonsense value to keep the compiler quiet about not returning a anything\n');
fprintf(fid, ' return (0 => (0 => (re => 0.0, im => 0.0)));\n');
fprintf(fid, ' end case;\n');
fprintf(fid, ' end function;\n');
fprintf(fid, '\n');
fprintf(fid, 'end package body;\n');
end
% Compose the test data file from multiple sources
package_name="test_data_fft_pkg";
fid = fopen(strcat(package_name, ".vhdl"), 'wt');
head(fid, package_name);
vhdl_test_data(fid, x4, "input_data_point4_c");
vhdl_test_data(fid, fft(x4), "output_data_point4_c");
vhdl_test_data(fid, x8, "input_data_point8_c");
vhdl_test_data(fid, fft(x8), "output_data_point8_c");
vhdl_test_data(fid, x16, "input_data_point16_c");
vhdl_test_data(fid, fft(x16), "output_data_point16_c");
vhdl_test_data(fid, x32, "input_data_point32_c");
vhdl_test_data(fid, fft(x32), "output_data_point32_c");
vhdl_test_data(fid, x512, "input_data_point512_c");
vhdl_test_data(fid, fft(x512), "output_data_point512_c");
tail(fid, package_name);
fclose(fid);
Construction
For each of the designs attempted the following construction was used. Illustrated here for the Radix-2 FFT, and quickly gets more involved for Radix-4+, so you will just have to imagine how it looks when recursing on 4 or 8 nested instantiations for the higher radices.

Note here that as we are using decimation in time, the bit reversal of the inputs is carried out on the left hand side. An additional level of hierarchy (shown in blue) is used specifically to bit reverse the inputs (explained in detail later) so that is taken care of before constructing the recursive hierarchy. It also means that I refer to indices in normal order in lower levels of hierarchy, and hence the use of indices may differ from some descriptions of the implementation found on the Internet. The subsequent levels of hierarchy are shown in orange and each level includes the lowers levels because of the recursion, rather than the instantiations being "siblings". The FFT "butterflies" are performed on the right hand side of the recursion, to combine the results, hence the crossed lines depicting the flow of values or the FFT "flow graph".
The matrices shown are for an n-point FFT performed by a radix-n transform. Their lines associate them with the orange squares and are intended to represent the transform or Fourier Matrix of the whole level of hierarchy independent of how it is actually implemented. The way that the columns and rows of the matrices increment their powers is described next.
Fourier Matrices
To work out how the columns of powers increment I used the Danielson-Lanczos Lemma, but noting the indices to \(x()\) have already been bit reversed so I use the normal order of indices even though the powers remain bit reversed. Here is a worked example for Radix-8 as its easier to see the pattern than with smaller radices.
\[ \begin{align} F(n) = &\, x(0)+\omega^n_2 x(4)+\omega^n_4 x(2) + \omega^n_4 \omega^n_2x(6) + \\ &\, \omega^n_8 x(1) + \omega^n_8\omega^n_2 x(5) + \omega^n_8\omega^n_4 x(3) + \omega^n_8\omega^n_4\omega^n_2 x(7) \\[10pt] =&\, x(0)+\omega^{4n}_8 x(4)+\omega^{2n}_8 x(2) + \omega^{6n}_8 x(6) + \\ &\, \omega^n_8 x(1) + \omega^{5n}_8 x(5) + \omega^{3n}_8 x(3) + \omega^{7n}_8 x(7) \end{align} \]This gives us our Fourier Matrix as follows (see also DFT Matrix on Wikipedia). Note the nomenclature between quoted references differs, but hopefully its obvious that given a twiddle factor \(\omega^k_N\), the columns of powers in the Fourier Matrix increment by \(k\), and the row increments match the order of terms in the Danielson-Lanczos Lemma.
\[ \begin{align} F_8 &= \frac{1}{\sqrt{8}} \begin{pmatrix} \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 \\ \omega_8^0 & \omega_8^4 & \omega_8^2 & \omega_8^6 & \omega_8^1 & \omega_8^5 & \omega_8^3 & \omega_8^7 \\ \omega_8^0 & \omega_8^8 & \omega_8^4 & \omega_8^{12} & \omega_8^2 & \omega_8^{10} & \omega_8^6 & \omega_8^{14} \\ \omega_8^0 & \omega_8^{12} & \omega_8^6 & \omega_8^{18} & \omega_8^3 & \omega_8^{15} & \omega_8^9 & \omega_8^{21} \\ \omega_8^0 & \omega_8^{16} & \omega_8^8 & \omega_8^{24} & \omega_8^4 & \omega_8^{20} & \omega_8^{12} & \omega_8^{28} \\ \omega_8^0 & \omega_8^{20} & \omega_8^{10} & \omega_8^{30} & \omega_8^5 & \omega_8^{25} & \omega_8^{15} & \omega_8^{35} \\ \omega_8^0 & \omega_8^{24} & \omega_8^{12} & \omega_8^{36} & \omega_8^6 & \omega_8^{30} & \omega_8^{18} & \omega_8^{42} \\ \omega_8^0 & \omega_8^{28} & \omega_8^{14} & \omega_8^{42} & \omega_8^7 & \omega_8^{35} & \omega_8^{21} & \omega_8^{49} \\ \end{pmatrix} \\[20pt] &= \frac{1}{\sqrt{8}} \begin{pmatrix} \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 \\ \omega_8^0 & \omega_8^4 & \omega_8^2 & \omega_8^6 & \omega_8^1 & \omega_8^5 & \omega_8^3 & \omega_8^7 \\ \omega_8^0 & \omega_8^0 & \omega_8^4 & \omega_8^4 & \omega_8^2 & \omega_8^2 & \omega_8^6 & \omega_8^6 \\ \omega_8^0 & \omega_8^4 & \omega_8^6 & \omega_8^2 & \omega_8^3 & \omega_8^7 & \omega_8^1 & \omega_8^5 \\ \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^0 & \omega_8^4 & \omega_8^4 & \omega_8^4 & \omega_8^4 \\ \omega_8^0 & \omega_8^4 & \omega_8^2 & \omega_8^6 & \omega_8^5 & \omega_8^1 & \omega_8^7 & \omega_8^3 \\ \omega_8^0 & \omega_8^0 & \omega_8^4 & \omega_8^4 & \omega_8^6 & \omega_8^6 & \omega_8^2 & \omega_8^2 \\ \omega_8^0 & \omega_8^4 & \omega_8^6 & \omega_8^2 & \omega_8^7 & \omega_8^3 & \omega_8^5 & \omega_8^1 \\ \end{pmatrix} \end{align} \]Understanding how to derive this Fourier matrix is the first step in understanding how to calculate the Radix-n transformation, but as you know, as it stands it is inefficient. To make the implementation efficient, the "butterfly" diagrams need to be derived, with the intention of reducing the number of multipliers consumed. Meanwhile the matrix makes a perhaps obvious statement about logic consumption, since each output is the sum of n inputs (a power of 2), and that summation needs to be implemented as a tree of n-1 adders in order to bound logic propagation delays. Also we are aiming to avoid the \(O(n^2)\) matrix multiplication cost.
Radix-2 FFT Butterflies
It is usual to find these diagrams as the sole explanation of the FFT implementation. I discovered this can be unhelpful when trying to derive the later stages of Radix-4+ FFTs. The simplifications are applied with a loss of derivation making it harder to extrapolate to higher radices.

Here the unoptimised multiplications are using the twiddle factors for repeated applications of Radix-2 FFT. The butterflies show the flow of values, but the number of multiplications can be reduced. To avoid confusion, the "bit reversal" powers are not evident for Radix-2 twiddle factors, wait until Radix-4 and more obviously Radix-8 to see how they play out. The colour scheme is intended to make clear along which lines of the flow graph the multiplications are performed. The notation is intended to match the VHDL code (later), where i() relates to the inputs to a level of hierarchy, t() relates to the outputs from recursion and hence inputs to the butterflies in stages 2+, and o() relates to the outputs. I was careless in not disambiguating "i".

Here the butterflies have been optimised by moving where the multiplications occur. Given that the first half of the twiddle factors are \(\omega^{^N/_2}_N\) separated from the second half, and that difference simplifies to value of -1, then half the multiplications can be exchanged for a trivial sign swap. The weights in red and blue correspond to the red crosses and blue circles in the roots of unity plot below. The green circles are the products derived from multiplication of one red cross and one blue circle, hence all the required weights are available and used in perfect symmetry.

Radix-2 VHDL Code
Before looking further at the derivation of the butterflies, let's look at how what we've covered so far stacks up as an implementation. Noting of course that the VHDL real type is not synthesisable, but remains a useful stepping stone for the purposes of simulating the structure and mathematics without some of the complications of using sfixed. Note also that this code is asynchronous, at this point I just want the answers that match my test data.
'real' Package
Some types and functions need to be assembled to realise the code. The package header confirms the types and which functions have been made public. Its the package body that requires some explanation.
library ieee;
use ieee.numeric_std.all;
use ieee.math_complex.all;
package fft_real_pkg is
-- Local type as ModelSim 10.5b libraries are not up to date
type complex_vector is array (integer range <>) of complex;
type complex_vector_arr_t is array (integer range<>) of complex_vector;
type natural_vector is array (integer range <>) of natural;
type natural_vector_arr_t is array (integer range<>) of integer_vector;
function bit_reverse(i : natural; bits : positive) return natural;
function array_reverse(i : complex_vector) return complex_vector;
function init_twiddles_half(num : positive) return complex_vector;
function twid_mod(
t : complex_vector;
n : natural
) return complex;
function twiddle_power(num : positive) return natural_vector;
function operation(
twids : complex_vector;
power : natural;
operand : complex;
id : string
) return complex;
function opt_pwr_arr(
powers : natural_vector;
group_width : positive
) return natural_vector_arr_t;
function part_pwr_arr(
powers : natural_vector;
group_width : positive
) return natural_vector_arr_t;
end package;
Each of the functions in the following package body are explained separately below.
library local;
package body fft_real_pkg is
function bit_reverse (i : natural; bits : positive) return natural is
variable tmp : unsigned(bits-1 downto 0) := to_unsigned(i, bits);
variable ret : unsigned(bits-1 downto 0);
begin
for j in 0 to bits-1 loop
ret(j) := tmp(tmp'high-j);
end loop;
return to_integer(ret);
end function;
function array_reverse (i : complex_vector) return complex_vector is
variable ret : complex_vector(i'range);
begin
-- Copy the input array elements across to their new location.
for j in i'range loop
ret(bit_reverse(j, local.math_pkg.ceil_log(i'length))) := i(j);
end loop;
return ret;
end function;
function init_twiddles_half(num : positive) return complex_vector is
variable ret : complex_vector(0 to num-1);
begin
-- Calculate the num'th root of unity and raise to the power of k.
for k in ret'range loop
ret(k) := (
re => ieee.math_real.cos(ieee.math_real.math_pi * real(k) / real(num)),
im => -ieee.math_real.sin(ieee.math_real.math_pi * real(k) / real(num))
);
end loop;
return ret;
end function;
function twid_mod(
t : complex_vector;
n : natural
) return complex is
variable idx : natural := n mod (2*t'length);
begin
if idx < t'length then
return t(idx);
else
-- Exploit the symmetry of FFT twiddles
return -t(idx-t'length);
end if;
end function;
function twiddle_power(num : positive) return natural_vector is
variable ret : natural_vector(0 to num-1);
begin
for i in ret'range loop
ret(i) := bit_reverse(i, local.math_pkg.ceil_log(num));
end loop;
return ret;
end function;
function operation(
twids : complex_vector;
power : natural;
operand : complex;
id : string
) return complex is
constant debug : boolean := true;
variable p : natural := power mod (2*twids'length);
begin
-- NB. minimum() function call required to deal with extreme values.
-- ** Fatal: (vsim-3421) Value 1.#INF for RE is out of range -1e+308 to 1e+308.
-- NB. maximum() function call required to deal with extreme values.
-- ** Fatal: (vsim-3421) Value -1.#INF for RE is out of range -1e+308 to 1e+308.
-- Cannot factor these out as a function which takes 'real' since the extreme values of the parameter are out of range, hence an error.
-- maximum(minimum(XX, real'high), real'low)
if p = 0 then
-- n=0, avoid multiply by 1
return operand;
elsif (4 * p) = (2 * twids'length) then
-- n=N/4, avoid multiply by -i
return (re => maximum(minimum(operand.im, real'high), real'low), im => maximum(minimum(-operand.re, real'high), real'low));
elsif (2 * p) = (2 * twids'length) then
-- n=N/2, avoid multiply by -1
return (re => maximum(minimum(-operand.re, real'high), real'low), im => maximum(minimum(-operand.im, real'high), real'low));
elsif (4 * p) = (3 * (2 * twids'length)) then
-- n=3*N/4, avoid multiply by i
return (re => maximum(minimum(-operand.im, real'high), real'low), im => maximum(minimum(operand.re, real'high), real'low));
else
-- We have to perform the multiplication
if debug then
report "Multiplier ID " & id & ": For W^n_N, N=" & integer'image(2*twids'length) & " n=" & integer'image(p) severity note;
end if;
return twid_mod(twids, p) * operand;
end if;
end function;
function opt_pwr_arr(
powers : natural_vector;
group_width : positive
) return natural_vector_arr_t is
constant radix_c : positive := powers'length;
variable opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1);
begin
col : for j in 1 to radix_c-1 loop
row : for k in opt_pwr(opt_pwr'low)'range loop -- (0 to radix_c-1)
opt_pwr(j)(k) := (powers(j)*k*group_width) mod (radix_c*group_width); -- NB. o'length = radix_c * group_width
end loop;
end loop;
return opt_pwr;
end function;
function part_pwr_arr(
powers : natural_vector;
group_width : positive
) return natural_vector_arr_t is
constant radix_c : positive := powers'length;
variable part_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to group_width-1);
begin
col : for j in 1 to radix_c-1 loop
row : for k in part_pwr(part_pwr'low)'range loop -- (0 to group_width_c-1)
part_pwr(j)(k) := (powers(j)*k) mod (radix_c*group_width); -- NB. o'length = radix_c * group_width
end loop;
end loop;
return part_pwr;
end function;
end package body;
Function 'bit_reverse'
Easiest illustrated through an example, here's the bit reversal of the indices of 8 values.
Value | Binary | Bit Reverse Binary | Bit Reversed Value |
---|---|---|---|
0 | 000 | 000 | 0 |
1 | 001 | 100 | 4 |
2 | 010 | 010 | 2 |
3 | 011 | 110 | 6 |
4 | 100 | 001 | 1 |
5 | 101 | 101 | 5 |
6 | 110 | 011 | 3 |
7 | 111 | 111 | 7 |
Function 'array_reverse'
Return an array of complex numbers with their positions permuted according to the bit reversal function above.
Function 'init_twiddles_half'
Initialise an array with the FFT twiddle factors according to the following equation.
\[ \omega^n_N=e^{\frac{-2 \pi ni}{N}} \]Note, the factors for half the unit circle only are initialised since the second half values are a trivial negative of the first half.
Function 'twid_mod'
Reduces large powers of twiddle factors using the following identity.
\[ \omega^{n+kN}_N = \omega^n_N, \enspace \forall k \in \mathbb{Z} \]Essentially it reduces the power mod N which is equivalent to not going round the unit circle repeatedly. This function also deals with the second half of the unit circle by returning the negation of the first half value.
Function 'twiddle_power'
Return the powers to raise the twiddle factors by according to the Danielson-Lanczos Lemma. Essentially for a Radix-n implementation, the sequence of numbers 0..(n-1) are individually bit reversed and returned as an array. E.g. the sequence "0, 1, 2, 3, 4, 5, 6, 7" becomes "0, 4, 2, 6, 1, 5, 3, 7" as in the columns of the table above. The need for this will become more obvious with the higher radix implementations.
Function 'operation'
Already mentioned is the popular Radix-2 simplification of \(\omega^{^N/_4}_N\). There are additional simplifications that can be applied for other weights as shown below. All of these convert multiplications to sign swaps or real and imaginary part swaps avoiding much more expensive multiplications. Here the original requirement was to provide the '+' or '-' operator based on the power of the twiddle factor, intercept the multiplication request and replace with the simpler transformation. With the Radix-4 implementation comes a similar requirement for '±i', hence function operation implements the following simplifications, else performs the actual multiplication. The fact of using a multiplication is reported during simulation for subsequent log analysis.
\[ \begin{align} \omega^0_N &= 1 \\ \omega^1_4 = \omega^2_8 = \omega^{^N/_4}_N &= -i \\ \omega^1_2 = \omega^2_4 = \omega^4_8 = \omega^{^N/_2}_N &= -1 \\ \omega^3_4 = \omega^6_8 = \omega^{^{3N}/_4}_N &= i \end{align} \]Function 'opt_pwr_arr'
Intended to be the array of "optimising powers" of the twiddle factors, i.e. those that were factored out as trivial multiplies for Radix-2 and Radix-4 and applied separately. These are the twiddle factors shown in red in the butterfly diagrams and as red crosses on the unit circle diagram.
Function 'part_pwr_arr'
Intended to be the array of "partial powers" of the twiddle factors, i.e. those that are unavoidably non-trivial multiplies. These are the twiddle factors shown in blue in the butterfly diagrams and as blue circles on the unit circle diagram.
Entity Specification
These VHDL entities are used for multiple architectures implementing different radix FFTs.
library ieee;
use ieee.math_complex.all;
use work.fft_real_pkg.all;
entity dftr_real is
generic (
log_num_inputs_g : positive -- E.g. 1024 point FFT => log_num_inputs_g = 10
);
port (
i : in complex_vector(0 to (2**log_num_inputs_g)-1);
o : out complex_vector(0 to (2**log_num_inputs_g)-1)
);
end entity;
use work.fft_real_pkg.all;
-- Perform the bit reversal on the input indices once at the top level, then call
-- the recursive component.
entity dft_real is
generic (
log_num_inputs_g : positive -- E.g. 1024 point FFT => log_num_inputs_g = 10
);
port (
i : in complex_vector(0 to (2**log_num_inputs_g)-1);
o : out complex_vector(0 to (2**log_num_inputs_g)-1)
);
end entity;
The generic log_num_inputs_g is unsurprisingly the log(number of inputs) specifically to base 2. This has been done deliberately in order to prevent instantiation of the FFT core with non powers of 2 inputs and outputs. dftr_real is the recursively instantiated entity only used after the top level entity dft_real has performed a permutation of the inputs according to "bit reversal". The architecture permuting the inputs is defined separately below because it references the specific radix2 recursive architecture to instantiate which must be compiled first.
Radix-2 Architecture
The architectures are defined in this order because the VHDL compiler needs to know about the existence of work.dftr_real(radix2) before it can be referenced. The second architecture provides the reordering of the inputs by bit reversal, and then instantiates the recursive component with the matching radix. Each architecture will be named by the radix it implements since they deliver on the same entity specification.
architecture radix2 of dftr_real is
constant twid_c : complex_vector(0 to (2**(log_num_inputs_g-1))-1) := init_twiddles_half(2**(log_num_inputs_g-1));
begin
recurse_g : if log_num_inputs_g = 1 generate
constant radix_c : positive := 2;
signal m : complex;
begin
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
-- Perform the Radix-2 FFT on two operands
m <= i(1); -- operation(twid_c, 0, i(1));
o(0) <= i(0) + m; -- twid_c(0)
o(1) <= i(0) - m; -- -twid_c(0) == twid(1) when using init_twiddles_full
else generate
constant radix_c : positive := 2;
constant group_width_c : positive := o'length/radix_c;
constant powers_c : natural_vector := twiddle_power(radix_c);
constant opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1) := opt_pwr_arr(powers_c, group_width_c);
constant part_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1) := part_pwr_arr(powers_c, group_width_c);
signal t : complex_vector(0 to (2**log_num_inputs_g)-1);
signal m : complex_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1);
begin
-- Recurse and combine
dftr_i0 : entity work.dftr_real(radix2)
generic map (
log_num_inputs_g => log_num_inputs_g-1
)
port map (
i => i(0 to i'length/2-1),
o => t(0 to t'length/2-1)
);
dftr_i1 : entity work.dftr_real(radix2)
generic map (
log_num_inputs_g => log_num_inputs_g-1
)
port map (
i => i(i'length/2 to i'high),
o => t(t'length/2 to t'high)
);
-- Reusable multiplications
col_g : for j in m'range generate
row_g : for k in m(j)'range generate -- (0 to group_width_c-1)
constant id : string := m'instance_name & integer'image(j) & "," & integer'image(k);
begin
m(j)(k) <= operation(twid_c, part_pwr(j)(k), t(k+(j*group_width_c)), id);
end generate;
end generate;
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
butterfly_g : for j in o'range generate
constant id : string := o'instance_name & integer'image(j);
begin
-- Optimised multiplication & sum as operation() avoids multiplications of 1, -i.
half_g : if j < group_width_c generate
-- Top Half of Butterfly
o(j) <= t(j) +
operation(twid_c, opt_pwr(1)(0), m(1)(j), id & "," & integer'image(1));
else generate
-- Bottom Half of Butterfly
o(j) <= t(j-group_width_c) +
operation(twid_c, opt_pwr(1)(1), m(1)(j-group_width_c), id & "," & integer'image(1));
end generate;
end generate;
end generate;
end architecture;
-- Just perform the bit reversal on the input indices, then call the recursive component.
architecture radix2 of dft_real is
begin
dftr_i : entity work.dftr_real(radix2)
generic map (
log_num_inputs_g => log_num_inputs_g
)
port map (
i => array_reverse(i),
o => o
);
end architecture;
There are several steps to the implementation. Firstly the part_pwr blue twiddle multiplications are calculated. The second step performs the butterfly operation with the opt_pwr red twiddle factors and sums. As we are using non-synthesisble real types, and we have no clock, pipelining has been overlooked completely, saving that for the sfixed implementation. The generate statement tests to see if recursion has completed and then instantiates a stage 1 quite trivial butterfly. Otherwise the subsequent part of this generate statement provides the recursion by splitting the work into two lower levels and then combining the results with a stage 2+ butterfly. At this point we need to generalise the stage 2+ butterflies for anything to work.
The notion of group_width needs to be introduced. This is the number of items in each half of the Radix-2 butterfly. For stage-2 of the Radix-2 FFT this means 2 items in each half, for stage 3 this means 4 items. The groupings can been seen in the optimised butterfly diagrams where the bottom half, or more generically "group", has blue diagonal and red horizontal lines. As the radix size increases, the number of groups will match the radix size, and the group width will be the number of inputs (or outputs) divided by the radix size. Check also how the opt_pwr and part_pwr arrays vary in length. Both are two dimensional. The first dimension of each is the number of blue twiddles minus the first \(\omega^0_N\) which will always be 1. The second dimension of opt_pwr relates to the Danielson-Lanczos Lemma defining the powers to use for each column of the Fourier matrix. The second dimension of part_pwr must match the group width to provide the different twiddles needed for combining butterflies in stages 2+, these relate to the blue circles on the unity circle. NB. For stage 1 the number of inputs and outputs equals the chosen radix, and hence the group width is 1. You can see the indexing of the combining stage is already getting quite involved. This will get worse as we increase the radix size and generalise further out of necessity, e.g. the number of term to sum for each output.
One code related optimisation that has not been done here is to remove the initial conditional generate statement. This will be done in the final Radix-n architecture. The plan is to remove any special cases which enumerate and demonstrate the simple butterflies and have a single combiner that copes with all butterflies. For now I wish to show my working, if only to jog my own memory.
End of Part 1
I will pause at this point as the blog software is straining with the size of post and taking a while to display the complex content. I will continue with Radix-4 and Radix-8 FFTs and show the generalised Radix-n implementation. Then move on to even more practical issues for what should be a synthesisable implementation even if no FPGA device will be large enough. I will also show how the number of arithmetic operations grows with increasing input size and radix size.
Postscript
I wish I had found this lecture sooner.
References
- Github Source Code
- Cooley-Tukey fast Fourier transform algorithm
- An Algorithm for the Machine Calculation of Complex Fourier Series, James W. Cooley and John W. Tukey - Original Paper for historical curiosity
- A DFT and FFT Tutorial
- Lecture notes: The Fast Fourier Transform; Radix–2 and Radix–4 Algorithms, Kenneth E. Barner, University of Delaware
- Digital Signal Processing Lectures by Prof. Rich Radke, Rensselaer Polytechnic Institute