Continuing from the previous part which covered a Radix-2 fully parallel if unsynthesisable FFT implementation, I will continue the progression of work to extrapolate to Radix-4, Radix-8 and using those to extrapolate to a more general a Radix-n implementation that will allow me to gather some results about the arithmetic usage when varying the number of inputs and the radix size.
- Radix-4 FFT
- Radix-4 FFT Butterflies
- Comparing Radix-4 and Radix-2 Twiddle Factors
- Radix-4 VHDL Code
- Radix-8 FFT
- Radix-8 VHDL Code
- Radix-n VHDL Code
- Arithmetic Usage
- Log Processing
- Theory
- Radix-8 Expected Size
- Results
- Radix-2 Construction
- Radix-4 Construction
- Radix-8 Construction
- Multiplication Operator Growth
- Radix-n Construction
- End of Part 2
- References
Radix-4 FFT
Radix-4 FFT Butterflies
Drawing the butterfly diagrams for Radix-4 becomes nearly impossible to provide any clarity without an A3 sheet diagram. This is probably why the literature does not explain how to perform stages 2+ of the Radix-4 FFT well. I found one useful resource only. I shall start by extruding the Radix-4 Fourier Matrix.
\[ F_4=\begin{pmatrix} \omega_4^0 & \omega_4^0 & \omega_4^0 & \omega_4^0 \\ \omega_4^0 & \omega_4^2 & \omega_4^1 & \omega_4^3 \\ \omega_4^0 & \omega_4^4 & \omega_4^2 & \omega_4^6 \\ \omega_4^0 & \omega_4^6 & \omega_4^3 & \omega_4^9 \end{pmatrix} \]Essentially this means adding more rows that follow the pattern started by F4, but I'll keep the powers within the mod 16 range. These are the unoptimised weights. NB. I hope you noticed that "optimising" the stage 1 application of any radix is a "no-op". I showed it previously for the Radix-2 as it follows the pattern, but ultimately the redrawn diagram make no difference.
\[ \begin{pmatrix} \omega_{16}^0 & \omega_{16}^0 & \omega_{16}^0 & \omega_{16}^0 \\ \omega_{16}^0 & \omega_{16}^2 & \omega_{16}^1 & \omega_{16}^3 \\ \omega_{16}^0 & \omega_{16}^4 & \omega_{16}^2 & \omega_{16}^6 \\ \omega_{16}^0 & \omega_{16}^6 & \omega_{16}^3 & \omega_{16}^9 \\ \omega_{16}^0 & \omega_{16}^8 & \omega_{16}^4 & \omega_{16}^{12} \\ \omega_{16}^0 & \omega_{16}^{10} & \omega_{16}^5 & \omega_{16}^{15} \\ \omega_{16}^0 & \omega_{16}^{12} & \omega_{16}^6 & \omega_{16}^2 \\ \omega_{16}^0 & \omega_{16}^{14} & \omega_{16}^7 & \omega_{16}^5 \\ \omega_{16}^0 & \omega_{16}^0 & \omega_{16}^8 & \omega_{16}^8 \\ \omega_{16}^0 & \omega_{16}^2 & \omega_{16}^9 & \omega_{16}^{11} \\ \omega_{16}^0 & \omega_{16}^4 & \omega_{16}^{10} & \omega_{16}^{14} \\ \omega_{16}^0 & \omega_{16}^6 & \omega_{16}^{11} & \omega_{16}^1 \\ \omega_{16}^0 & \omega_{16}^8 & \omega_{16}^{12} & \omega_{16}^4 \\ \omega_{16}^0 & \omega_{16}^{10} & \omega_{16}^{13} & \omega_{16}^7 \\ \omega_{16}^0 & \omega_{16}^{12} & \omega_{16}^{14} & \omega_{16}^{10} \\ \omega_{16}^0 & \omega_{16}^{14} & \omega_{16}^{15} & \omega_{16}^{13} \end{pmatrix} \]Consider this a collection of numbers organised in a grid, or two dimensional array rather than a matrix, it's not a Fourier Matrix. Note that we have 16 rows for 16 inputs, but four columns for the four partial products summed per row. If we were seeking to make this a 16x16 matrix there would be lots of zeros padding out twiddle factors into the correct columns for each output.
Next I'll dive for a spreadsheet, as a diagram is too cluttered. Later I will include a partially annotated diagram to show how the spreadsheet relates to the butterflies for your inspection, and to illustrate the lesson learned about how the twiddle factor powers for the 1st stage might prove misleading in application for the later stages.
Column 0 (of 0..3) with power multiple 0. | ||
Output Index | Input Group | Input Index |
---|---|---|
0 | 0 | 0 |
1 | 0 | 1 |
2 | 0 | 2 |
3 | 0 | 3 |
4 | 1 | 0 |
5 | 1 | 1 |
6 | 1 | 2 |
7 | 1 | 3 |
8 | 2 | 0 |
9 | 2 | 1 |
10 | 2 | 2 |
11 | 2 | 3 |
12 | 3 | 0 |
13 | 3 | 1 |
14 | 3 | 2 |
15 | 3 | 3 |
Column 1 (of 0..3) with power multiple 2. | |||||
Output Index | Unoptimised Twiddle Factor | Optimised Power (opt_pwr) | opt_pwr Operation | Partial Power (part_pwr) | Input Index |
---|---|---|---|---|---|
0 | 0 | 0 | +1 | 0 | 4 |
1 | 2 | 0 | +1 | 2 | 5 |
2 | 4 | 0 | +1 | 4 | 6 |
3 | 6 | 0 | +1 | 6 | 7 |
4 | 8 | 8 | -1 | 0 | 4 |
5 | 10 | 8 | -1 | 2 | 5 |
6 | 12 | 8 | -1 | 4 | 6 |
7 | 14 | 8 | -1 | 6 | 7 |
8 | 0 | 0 | +1 | 0 | 4 |
9 | 2 | 0 | +1 | 2 | 5 |
10 | 4 | 0 | +1 | 4 | 6 |
11 | 6 | 0 | +1 | 6 | 7 |
12 | 8 | 8 | -1 | 0 | 4 |
13 | 10 | 8 | -1 | 2 | 5 |
14 | 12 | 8 | -1 | 4 | 6 |
15 | 14 | 8 | -1 | 6 | 7 |
Column 2 (of 0..3) with power multiple 1. | |||||
Output Index | Unoptimised Twiddle Factor | Optimised Power (opt_pwr) | opt_pwr Operation | Partial Power (part_pwr) | Input Index |
---|---|---|---|---|---|
0 | 0 | 0 | +1 | 0 | 8 |
1 | 1 | 0 | +1 | 1 | 9 |
2 | 2 | 0 | +1 | 2 | 10 |
3 | 3 | 0 | +1 | 3 | 11 |
4 | 4 | 4 | -i | 0 | 8 |
5 | 5 | 4 | -i | 1 | 9 |
6 | 6 | 4 | -i | 2 | 10 |
7 | 7 | 4 | -i | 3 | 11 |
8 | 8 | 8 | -1 | 0 | 8 |
9 | 9 | 8 | -1 | 1 | 9 |
10 | 10 | 8 | -1 | 2 | 10 |
11 | 11 | 8 | -1 | 3 | 11 |
12 | 12 | 12 | +i | 0 | 8 |
13 | 13 | 12 | +i | 1 | 9 |
14 | 14 | 12 | +i | 2 | 10 |
15 | 15 | 12 | +i | 3 | 11 |
Column 3 (of 0..3) with power multiple 3. | |||||
Output Index | Unoptimised Twiddle Factor | Optimised Power (opt_pwr) | opt_pwr Operation | Partial Power (part_pwr) | Input Index |
---|---|---|---|---|---|
0 | 0 | 0 | +1 | 0 | 12 |
1 | 3 | 0 | +1 | 3 | 13 |
2 | 6 | 0 | +1 | 6 | 14 |
3 | 9 | 0 | +1 | 9 | 15 |
4 | 12 | 12 | +i | 0 | 12 |
5 | 15 | 12 | +i | 3 | 13 |
6 | 2 | 12 | +i | 6 | 14 |
7 | 5 | 12 | +i | 9 | 15 |
8 | 8 | 8 | -1 | 0 | 12 |
9 | 11 | 8 | -1 | 3 | 13 |
10 | 14 | 8 | -1 | 6 | 14 |
11 | 1 | 8 | -1 | 9 | 15 |
12 | 4 | 4 | -i | 0 | 12 |
13 | 7 | 4 | -i | 3 | 13 |
14 | 10 | 4 | -i | 6 | 14 |
15 | 13 | 4 | -i | 9 | 15 |
The aim with this table is to split the "Unoptimised Twiddle Factor" column into two which when combined give the same twiddle factor but when separated give one column that has re-usable multiplications. This is a factorisation problem. The spreadsheet is therefore quite large and split into a separate table for each column of the extruded 16x4 grid. The columns of the grid are taken in turn as the unoptimised twiddle factors. Hence the first table does nothing as all factors are "1", but the rows corresponding to each output are labelled with a "group" in the range 0..3. In this case the group width is 4 and the number of groups is dictated by the radix. If there had been 32 inputs in this example because of a preceeding stage of Radix-2, the group width would have increased to 8, and the tables get larger, so are best left in the spreadsheet software. The second and subsequent tables take their unoptimised twiddle factor powers from their respective grid column. Labelling the columns 0..3, column 0 has a power of 0 applied, column 1 has a power of 2 applied, column 2 has a power of 1 applied and column 3 a power of 3 applied to each twiddle factor in turn down the column. Note the column powers here "0, 2, 1, 3", which is the bit reversed order of the column number "0, 1, 2, 3". So now we have the right but unoptimised power to use in the summation of each output. Take an output index and read off across each table the unoptimised twiddle factor power, multiply by the input indexed in the final column of each table and sum these to get the result for that output. This should give the same twiddle factors as using the extruded grid except that the input indices are missing.
Next we'll factorise the unoptimised twiddle factors into two columns such that the new pair of powers sum to the unoptimised column. The two columns to examine are the "Optimised Power (opt_pwr)" and "Partial Power (part_pwr)". The other objective in separating these powers is to ensure that there are repeated {part_pwr, input index} pairs such that the same multiplication product can be used many times. All twiddle factor powers listed are mod 16 which enable the powers on each row to be summed such that part_pwrrow + opt_pwrrow = "Unoptimised Twiddle Factor"row. The optimised powers all reduce to trivial multiplications which can be substituted for sign and real-imaginary pair swaps. The results are best read off the table by examination, and give us an optimised power per group and a set of partial powers (of size groups width, it just happens in this example that the group width equals the radix so this is not obvious) that are reused on the same inputs and whose products are then subsequently combined with the optimised power before summation, with one sum operand from each table. I will now attempt to illustrate the butterflies in the next figure, from which you will deduce that there's a really good reason why you won't find many pictures illustrating the butterflies beyond stage 1.

The notation is intended to match the VHDL code (later), where i() relates to the inputs to a level of hierarchy, t() relates to the outputs from recursion and hence inputs to the butterflies in stages 2+, and o() relates to the outputs. I was careless in not disambiguating "i". The butterflies are drawn in order to depict just how busy the flow graphs now get. I have annotated only four of the weigths onto the unoptimised stage 2 diagram and these twiddle factor powers should correspond with the tables. As the optimised diagram has multiplier product re-use it is less crowded but still at the limits of readability, and only if you have already familiarised yourself the the diagrams for the Radix-2 butterflies so that you can deduce the short hand used.

I'll just pause a moment to look at the spread of the final twiddle factors used. Essentially, each optimising power (red crosses), rotates the partial powers (shown as circles) around the unit circle. The twiddle factor powers for each output are not evenly distributed around the unit circle, but become so in the macro picture over all outputs.
Repeating the spreadsheet exercise for 32 inputs (Radix-2 followed by Radix-4 gives a 32-point FFT) is instructive for discerning how the array of part_pwr powers gets longer (double in size) when the opt_pwr array does not.
Comparing Radix-4 and Radix-2 Twiddle Factors
I was initially duped into the wrong answer for recursively applying the optimised twiddle factors to stages 2+ of the Radix-4 FFT. In hindsight I see that it matters when the bit reversal is applied, for Decimation in Time (DIT) its up front, for Decimation in Frequency (DIF) its part of the end product and (optionally unpermuted by bit reversal). I was too eager to follow suit with the recursive application of the twiddle factors as for Radix-2 which is simply a 'minus' instead of a 'plus' between bottom and top halves. Surely this beautiful algorithm will just use some obvious extension of the same multiplications? Looking at the VHDL arrays returned from the VHDL functions in ModelSim, I pull out the following array values.
Now the stage 1 application has a group width of 1, hence there are no partial powers. I find it reassuring that both the opt_pwr and part_pwr arrays for the stage 2 application (above) seem to mirror the opt_pwr array for the first stage (below) once the powers are reduce mod 4.
There is also symmetry in the application of the Radix-4 opt_pwr powers that will probably differ from here from many diagrams on-line. The series of powers make more sense when looking at the application to the inputs rather than applying them to the outputs. So you will see the powers incrementing in the expect way on the input side of the stage 1 Radix-4 FFT, and you can see the agreement between stages 1 & 2 on the output side. When you consider the application of bit reversal to the inputs, you will then see how the output powers will mirror diagrams elsewhere on the Internet which may have omitted to declare their DIT/DIF origin. Phew!
For a more thorough explanation of how and why twiddle factors being the nth roots of unity works, try this YouTube video.
Radix-4 VHDL Code
Recall that I have retained my workings as commented out code so that I could retrace my steps in the derivation of the final answer. The code includes the fall back to Radix-2 when there are too few inputs to complete the recursion with Radix-4. This is hard coded at present since we have not yet derived the Radix-n implementation.
architecture radix4 of dftr_real is
constant twid_c : complex_vector(0 to (2**(log_num_inputs_g-1))-1) := init_twiddles_half(2**(log_num_inputs_g-1));
begin
recurse_g : if log_num_inputs_g = 1 generate
constant radix_c : positive := 2;
signal m : complex;
begin
-- Perform the Radix-2 FFT on two operands
m <= i(1); -- operation(twid_c, 0, i(1));
o(0) <= i(0) + m; -- twid_c(0)
o(1) <= i(0) - m; -- -twid_c(0) == twid(1) when using init_twiddles_full
elsif log_num_inputs_g = 2 generate
constant radix_c : positive := 4;
constant group_width_c : positive := o'length/radix_c; -- Always 1
constant powers_c : natural_vector := twiddle_power(radix_c);
constant opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1) := opt_pwr_arr(powers_c, group_width_c);
signal m : complex_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1);
begin
-- As group_width_c = 1, There are no roots of unity between opt_pwr roots.
m(1)(0) <= i(1); -- +1 x i(1)
m(2)(0) <= i(2); -- +1 x i(2)
m(3)(0) <= i(3); -- +1 x i(3)
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
butterfly_g : for j in o'range generate
constant id : string := o'instance_name & integer'image(j);
begin
-- Optimised multiplication & sum as operation() avoids multiplications of 1, -i, -1 & i.
quarter_g : if j < group_width_c generate
o(j) <= i(j) +
operation(twid_c, opt_pwr(1)(0), m(1)(j), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(0), m(2)(j), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(0), m(3)(j), id & "," & integer'image(3));
elsif j < 2 * group_width_c generate
o(j) <= i(j- group_width_c ) +
operation(twid_c, opt_pwr(1)(1), m(1)(j-group_width_c), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(1), m(2)(j-group_width_c), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(1), m(3)(j-group_width_c), id & "," & integer'image(3));
elsif j < 3 * group_width_c generate
o(j) <= i(j-(2*group_width_c)) +
operation(twid_c, opt_pwr(1)(2), m(1)(j-(2*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(2), m(2)(j-(2*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(2), m(3)(j-(2*group_width_c)), id & "," & integer'image(3));
else generate
o(j) <= i(j-(3*group_width_c)) +
operation(twid_c, opt_pwr(1)(3), m(1)(j-(3*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(3), m(2)(j-(3*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(3), m(3)(j-(3*group_width_c)), id & "," & integer'image(3));
end generate;
end generate;
-- -- Perform the Radix-4 FFT on four operands. Indexing of twiddle factors has been generalised and hence is unoptimised by butterflies
-- o(0) <= i(0) + (twid_mod(twid_c, 0) * i(1)) + (twid_mod(twid_c, 0) * i(2)) + (twid_mod(twid_c, 0) * i(3));
-- o(1) <= i(0) + (twid_mod(twid_c, 2) * i(1)) + (twid_mod(twid_c, 1) * i(2)) + (twid_mod(twid_c, 3) * i(3));
-- o(2) <= i(0) + (twid_mod(twid_c, 4) * i(1)) + (twid_mod(twid_c, 2) * i(2)) + (twid_mod(twid_c, 6) * i(3));
-- o(3) <= i(0) + (twid_mod(twid_c, 6) * i(1)) + (twid_mod(twid_c, 3) * i(2)) + (twid_mod(twid_c, 9) * i(3));
else generate
constant radix_c : positive := 4;
constant group_width_c : positive := o'length/radix_c;
constant powers_c : natural_vector := twiddle_power(radix_c);
constant opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1) := opt_pwr_arr(powers_c, group_width_c);
constant part_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1) := part_pwr_arr(powers_c, group_width_c);
signal t : complex_vector(0 to (2**log_num_inputs_g)-1);
signal m : complex_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1);
begin
-- Recurse
dftr_i0 : entity work.dftr_real(radix4)
generic map (
log_num_inputs_g => log_num_inputs_g-2
)
port map (
i => i(0 to i'length/4-1),
o => t(0 to t'length/4-1)
);
dftr_i1 : entity work.dftr_real(radix4)
generic map (
log_num_inputs_g => log_num_inputs_g-2
)
port map (
i => i(i'length/4 to i'length/2-1),
o => t(t'length/4 to t'length/2-1)
);
dftr_i2 : entity work.dftr_real(radix4)
generic map (
log_num_inputs_g => log_num_inputs_g-2
)
port map (
i => i(i'length/2 to 3*i'length/4-1),
o => t(t'length/2 to 3*t'length/4-1)
);
dftr_i3 : entity work.dftr_real(radix4)
generic map (
log_num_inputs_g => log_num_inputs_g-2
)
port map (
i => i(3*i'length/4 to i'high),
o => t(3*t'length/4 to t'high)
);
-- Combine
-- Reusable multiplications
col_g : for j in m'range generate
row_g : for k in m(j)'range generate -- (0 to group_width_c-1)
constant id : string := m'instance_name & integer'image(j) & "," & integer'image(k);
begin
m(j)(k) <= operation(twid_c, part_pwr(j)(k), t(k+(j*group_width_c)), id);
end generate;
end generate;
-- Pre-multiplication
-- 32-inputs Radix-4
-- m(1)(0) <= operation(twid_c, 0, t(0+8));
-- m(1)(1) <= operation(twid_c, 2, t(1+8));
-- m(1)(2) <= operation(twid_c, 4, t(2+8));
-- m(1)(3) <= operation(twid_c, 6, t(3+8));
-- m(1)(4) <= operation(twid_c, 0, t(4+8));
-- m(1)(5) <= operation(twid_c, 2, t(5+8));
-- m(1)(6) <= operation(twid_c, 4, t(6+8));
-- m(1)(7) <= operation(twid_c, 6, t(7+8));
--
-- m(2)(0) <= operation(twid_c, 0, t(0+16));
-- m(2)(1) <= operation(twid_c, 1, t(1+16));
-- m(2)(2) <= operation(twid_c, 2, t(2+16));
-- m(2)(3) <= operation(twid_c, 3, t(3+16));
-- m(2)(4) <= operation(twid_c, 4, t(4+16));
-- m(2)(5) <= operation(twid_c, 5, t(5+16));
-- m(2)(6) <= operation(twid_c, 6, t(6+16));
-- m(2)(7) <= operation(twid_c, 7, t(7+16));
--
-- m(3)(0) <= operation(twid_c, 0, t(0+24));
-- m(3)(1) <= operation(twid_c, 3, t(1+24));
-- m(3)(2) <= operation(twid_c, 6, t(2+24));
-- m(3)(3) <= operation(twid_c, 1, t(3+24)); -- 9
-- m(3)(4) <= operation(twid_c, 4, t(4+24)); -- 12
-- m(3)(5) <= operation(twid_c, 7, t(5+24)); -- 15
-- m(3)(6) <= operation(twid_c, 2, t(6+24)); -- 18
-- m(3)(7) <= operation(twid_c, 5, t(7+24)); -- 21
-- 16-inputs Radix-4
-- m(1)(0) <= operation(twid_c, 0, t(0+4));
-- m(1)(1) <= operation(twid_c, 2, t(1+4));
-- m(1)(2) <= operation(twid_c, 0, t(2+4)); -- 4
-- m(1)(3) <= operation(twid_c, 2, t(3+4)); -- 6
--
-- m(2)(0) <= operation(twid_c, 0, t(0+8));
-- m(2)(1) <= operation(twid_c, 1, t(1+8));
-- m(2)(2) <= operation(twid_c, 2, t(2+8));
-- m(2)(3) <= operation(twid_c, 3, t(3+8));
--
-- m(3)(0) <= operation(twid_c, 0, t(0+12));
-- m(3)(1) <= operation(twid_c, 3, t(1+12));
-- m(3)(2) <= operation(twid_c, 2, t(2+12)); -- 6
-- m(3)(3) <= operation(twid_c, 1, t(3+12)); -- 9
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
butterfly_g : for j in o'range generate
constant id : string := o'instance_name & integer'image(j);
begin
-- Perform the Radix-4 FFT on four operands. Indexing of twiddle factors has been generalised and hence is unoptimised by butterflies
-- quarter_g : if j < group_width_c generate
-- o(j) <= t(j ) + (twid_mod(twid_c, 2*j) * t(j+ group_width_c )) + (twid_mod(twid_c, j) * t(j+(2*group_width_c))) + (twid_mod(twid_c, 3*j) * t(j+(3*group_width_c)));
-- elsif j < 2 * group_width_c generate
-- o(j) <= t(j- group_width_c ) + (twid_mod(twid_c, 2*j) * t(j )) + (twid_mod(twid_c, j) * t(j+ group_width_c )) + (twid_mod(twid_c, 3*j) * t(j+(2*group_width_c)));
-- elsif j < 3 * group_width_c generate
-- o(j) <= t(j-(2*group_width_c)) + (twid_mod(twid_c, 2*j) * t(j- group_width_c )) + (twid_mod(twid_c, j) * t(j )) + (twid_mod(twid_c, 3*j) * t(j+ group_width_c ));
-- else generate
-- o(j) <= t(j-(3*group_width_c)) + (twid_mod(twid_c, 2*j) * t(j-(2*group_width_c))) + (twid_mod(twid_c, j) * t(j- group_width_c )) + (twid_mod(twid_c, 3*j) * t(j ));
-- end generate;
-- Optimised multiplication & sum as operation() avoids multiplications of 1, -i, -1 & i.
quarter_g : if j < group_width_c generate
o(j) <= t(j ) +
operation(twid_c, opt_pwr(1)(0), m(1)(j), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(0), m(2)(j), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(0), m(3)(j), id & "," & integer'image(3));
elsif j < 2 * group_width_c generate
o(j) <= t(j- group_width_c ) +
operation(twid_c, opt_pwr(1)(1), m(1)(j-group_width_c), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(1), m(2)(j-group_width_c), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(1), m(3)(j-group_width_c), id & "," & integer'image(3));
elsif j < 3 * group_width_c generate
o(j) <= t(j-(2*group_width_c)) +
operation(twid_c, opt_pwr(1)(2), m(1)(j-(2*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(2), m(2)(j-(2*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(2), m(3)(j-(2*group_width_c)), id & "," & integer'image(3));
else generate
o(j) <= t(j-(3*group_width_c)) +
operation(twid_c, opt_pwr(1)(3), m(1)(j-(3*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(3), m(2)(j-(3*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(3), m(3)(j-(3*group_width_c)), id & "," & integer'image(3));
end generate;
end generate;
end generate;
end architecture;
-- Just perform the bit reversal on the input indices, then call the recursive component.
architecture radix4 of dft_real is
begin
dftr_i : entity work.dftr_real(radix4)
generic map (
log_num_inputs_g => log_num_inputs_g
)
port map (
i => array_reverse(i),
o => o
);
end architecture;
Radix-8 FFT
The butterfly diagrams will not be drawn. If they are required, it is a spreadsheet exercise, but actually the step to Radix-4 stage 2+ butterflies means that the step to Radix-8 was done by following the trend and proving that worked. Getting the Radix-8 FFT working was the final confirmatory step for creating the Radix-n architecture.
Radix-8 VHDL Code
architecture radix8 of dftr_real is
constant twid_c : complex_vector(0 to (2**(log_num_inputs_g-1))-1) := init_twiddles_half(2**(log_num_inputs_g-1));
begin
recurse_g : if log_num_inputs_g = 1 generate
constant radix_c : positive := 2;
signal m : complex;
begin
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
-- Perform the Radix-2 FFT on two operands
m <= i(1); -- operation(twid_c, 0, i(1));
o(0) <= i(0) + m; -- twid_c(0)
o(1) <= i(0) - m; -- -twid_c(0) == twid(1) when using init_twiddles_full
elsif log_num_inputs_g = 2 generate
constant radix_c : positive := 4;
constant group_width_c : natural := o'length/radix_c; -- Always 1
constant powers_c : natural_vector := twiddle_power(radix_c);
constant opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1) := opt_pwr_arr(powers_c, group_width_c);
signal m : complex_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1);
begin
-- As group_width_c = 1, There are no roots of unity between opt_pwr roots.
m(1)(0) <= i(1); -- +1 x i(1)
m(2)(0) <= i(2); -- +1 x i(2)
m(3)(0) <= i(3); -- +1 x i(3)
-- Perform the Radix-4 FFT on four operands.
-- -- +1 +1 +1 +1
-- o(0) <= i(0) + opt_pwr(1)(0) + opt_pwr(2)(0) + opt_pwr(3)(0);
-- -- +1 -1 -i +i
-- o(1) <= i(0) + opt_pwr(1)(1) + opt_pwr(2)(1) + opt_pwr(3)(1);
-- -- +1 +1 -1 -1
-- o(2) <= i(0) + opt_pwr(1)(0) + opt_pwr(2)(0) + opt_pwr(3)(0);
-- -- +1 -1 +i -i
-- o(3) <= i(0) + opt_pwr(1)(1) + opt_pwr(2)(1) + opt_pwr(3)(1);
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
butterfly_g : for j in o'range generate
constant id : string := o'instance_name & integer'image(j);
begin
-- Optimised multiplication & sum as operation() avoids multiplications of 1, -i, -1 & i.
-- As group_width_c = 1, (j-(group*group_width_c)) = 0.
quarter_g : if j < group_width_c generate
o(j) <= i(j) +
operation(twid_c, opt_pwr(1)(0), m(1)(j), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(0), m(2)(j), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(0), m(3)(j), id & "," & integer'image(3));
elsif j < 2 * group_width_c generate
o(j) <= i(j-group_width_c) +
operation(twid_c, opt_pwr(1)(1), m(1)(j-group_width_c), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(1), m(2)(j-group_width_c), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(1), m(3)(j-group_width_c), id & "," & integer'image(3));
elsif j < 3 * group_width_c generate
o(j) <= i(j-(2*group_width_c)) +
operation(twid_c, opt_pwr(1)(2), m(1)(j-(2*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(2), m(2)(j-(2*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(2), m(3)(j-(2*group_width_c)), id & "," & integer'image(3));
else generate
o(j) <= i(j-(3*group_width_c)) +
operation(twid_c, opt_pwr(1)(3), m(1)(j-(3*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(3), m(2)(j-(3*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(3), m(3)(j-(3*group_width_c)), id & "," & integer'image(3));
end generate;
end generate;
elsif log_num_inputs_g = 3 generate
constant radix_c : positive := 8;
constant group_width_c : natural := o'length/radix_c; -- Always 1
constant powers_c : natural_vector := twiddle_power(radix_c);
constant opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1) := opt_pwr_arr(powers_c, group_width_c);
signal m : complex_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1);
begin
-- Perform the Radix-8 FFT on eight operands. Indexing of twiddle factors has been generalised,
-- leaving simplifications to a function 'twid_mod'.
-- o(0) <= i(0) + (twid_mod(twid_c, 0) * i(1)) + (twid_mod(twid_c, 0) * i(2)) + (twid_mod(twid_c, 0) * i(3)) + (twid_mod(twid_c, 0) * i(4)) + (twid_mod(twid_c, 0) * i(5)) + (twid_mod(twid_c, 0) * i(6)) + (twid_mod(twid_c, 0) * i(7));
-- o(1) <= i(0) + (twid_mod(twid_c, 4) * i(1)) + (twid_mod(twid_c, 2) * i(2)) + (twid_mod(twid_c, 6) * i(3)) + (twid_mod(twid_c, 1) * i(4)) + (twid_mod(twid_c, 5) * i(5)) + (twid_mod(twid_c, 3) * i(6)) + (twid_mod(twid_c, 7) * i(7));
-- o(2) <= i(0) + (twid_mod(twid_c, 8) * i(1)) + (twid_mod(twid_c, 4) * i(2)) + (twid_mod(twid_c, 12) * i(3)) + (twid_mod(twid_c, 2) * i(4)) + (twid_mod(twid_c, 10) * i(5)) + (twid_mod(twid_c, 6) * i(6)) + (twid_mod(twid_c, 14) * i(7));
-- o(3) <= i(0) + (twid_mod(twid_c, 12) * i(1)) + (twid_mod(twid_c, 6) * i(2)) + (twid_mod(twid_c, 18) * i(3)) + (twid_mod(twid_c, 3) * i(4)) + (twid_mod(twid_c, 15) * i(5)) + (twid_mod(twid_c, 9) * i(6)) + (twid_mod(twid_c, 21) * i(7));
-- o(4) <= i(0) + (twid_mod(twid_c, 16) * i(1)) + (twid_mod(twid_c, 8) * i(2)) + (twid_mod(twid_c, 24) * i(3)) + (twid_mod(twid_c, 4) * i(4)) + (twid_mod(twid_c, 20) * i(5)) + (twid_mod(twid_c, 12) * i(6)) + (twid_mod(twid_c, 28) * i(7));
-- o(5) <= i(0) + (twid_mod(twid_c, 20) * i(1)) + (twid_mod(twid_c, 10) * i(2)) + (twid_mod(twid_c, 30) * i(3)) + (twid_mod(twid_c, 5) * i(4)) + (twid_mod(twid_c, 25) * i(5)) + (twid_mod(twid_c, 15) * i(6)) + (twid_mod(twid_c, 35) * i(7));
-- o(6) <= i(0) + (twid_mod(twid_c, 24) * i(1)) + (twid_mod(twid_c, 12) * i(2)) + (twid_mod(twid_c, 36) * i(3)) + (twid_mod(twid_c, 6) * i(4)) + (twid_mod(twid_c, 30) * i(5)) + (twid_mod(twid_c, 18) * i(6)) + (twid_mod(twid_c, 42) * i(7));
-- o(7) <= i(0) + (twid_mod(twid_c, 28) * i(1)) + (twid_mod(twid_c, 14) * i(2)) + (twid_mod(twid_c, 42) * i(3)) + (twid_mod(twid_c, 7) * i(4)) + (twid_mod(twid_c, 35) * i(5)) + (twid_mod(twid_c, 21) * i(6)) + (twid_mod(twid_c, 49) * i(7));
-- As group_width_c = 1, There are no roots of unity between opt_pwr roots.
m(1)(0) <= i(1); -- operation(twid_c, powers_c(1)*0, i(1))
m(2)(0) <= i(2); -- operation(twid_c, powers_c(2)*0, i(2))
m(3)(0) <= i(3); -- operation(twid_c, powers_c(3)*0, i(3))
m(4)(0) <= i(4); -- operation(twid_c, powers_c(4)*0, i(4))
m(5)(0) <= i(5); -- operation(twid_c, powers_c(5)*0, i(5))
m(6)(0) <= i(6); -- operation(twid_c, powers_c(6)*0, i(6))
m(7)(0) <= i(7); -- operation(twid_c, powers_c(7)*0, i(7))
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
-- Perform the Radix-8 FFT on eight operands. As group_width_c = 1, (j-(group*group_width_c)) = 0.
butterfly_g : for j in o'range generate
constant id : string := o'instance_name & integer'image(j);
begin
eigth_g : if j < group_width_c generate
-- assert false report "m(?)(" & integer'image(j) & ")" severity note;
o(j) <= i(j) +
operation(twid_c, opt_pwr(1)(0), m(1)(j), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(0), m(2)(j), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(0), m(3)(j), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(0), m(4)(j), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(0), m(5)(j), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(0), m(6)(j), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(0), m(7)(j), id & "," & integer'image(7));
elsif j < 2 * group_width_c generate
-- assert false report "m(?)(" & integer'image(j-group_width_c) & ")" severity note;
o(j) <= i(j-group_width_c) +
operation(twid_c, opt_pwr(1)(1), m(1)(j-group_width_c), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(1), m(2)(j-group_width_c), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(1), m(3)(j-group_width_c), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(1), m(4)(j-group_width_c), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(1), m(5)(j-group_width_c), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(1), m(6)(j-group_width_c), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(1), m(7)(j-group_width_c), id & "," & integer'image(7));
elsif j < 3 * group_width_c generate
-- assert false report "m(?)(" & integer'image(j-(2*group_width_c)) & ")" severity note;
o(j) <= i(j-(2*group_width_c)) +
operation(twid_c, opt_pwr(1)(2), m(1)(j-(2*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(2), m(2)(j-(2*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(2), m(3)(j-(2*group_width_c)), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(2), m(4)(j-(2*group_width_c)), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(2), m(5)(j-(2*group_width_c)), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(2), m(6)(j-(2*group_width_c)), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(2), m(7)(j-(2*group_width_c)), id & "," & integer'image(7));
elsif j < 4 * group_width_c generate
-- assert false report "m(?)(" & integer'image(j-(3*group_width_c)) & ")" severity note;
o(j) <= i(j-(3*group_width_c)) +
operation(twid_c, opt_pwr(1)(3), m(1)(j-(3*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(3), m(2)(j-(3*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(3), m(3)(j-(3*group_width_c)), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(3), m(4)(j-(3*group_width_c)), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(3), m(5)(j-(3*group_width_c)), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(3), m(6)(j-(3*group_width_c)), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(3), m(7)(j-(3*group_width_c)), id & "," & integer'image(7));
elsif j < 5 * group_width_c generate
-- assert false report "m(?)(" & integer'image(j-(4*group_width_c)) & ")" severity note;
o(j) <= i(j-(4*group_width_c)) +
operation(twid_c, opt_pwr(1)(4), m(1)(j-(4*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(4), m(2)(j-(4*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(4), m(3)(j-(4*group_width_c)), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(4), m(4)(j-(4*group_width_c)), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(4), m(5)(j-(4*group_width_c)), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(4), m(6)(j-(4*group_width_c)), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(4), m(7)(j-(4*group_width_c)), id & "," & integer'image(7));
elsif j < 6 * group_width_c generate
-- assert false report "m(?)(" & integer'image(j-(5*group_width_c)) & ")" severity note;
o(j) <= i(j-(5*group_width_c)) +
operation(twid_c, opt_pwr(1)(5), m(1)(j-(5*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(5), m(2)(j-(5*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(5), m(3)(j-(5*group_width_c)), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(5), m(4)(j-(5*group_width_c)), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(5), m(5)(j-(5*group_width_c)), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(5), m(6)(j-(5*group_width_c)), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(5), m(7)(j-(5*group_width_c)), id & "," & integer'image(7));
elsif j < 7 * group_width_c generate
-- assert false report "m(?)(" & integer'image(j-(6*group_width_c)) & ")" severity note;
o(j) <= i(j-(6*group_width_c)) +
operation(twid_c, opt_pwr(1)(6), m(1)(j-(6*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(6), m(2)(j-(6*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(6), m(3)(j-(6*group_width_c)), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(6), m(4)(j-(6*group_width_c)), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(6), m(5)(j-(6*group_width_c)), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(6), m(6)(j-(6*group_width_c)), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(6), m(7)(j-(6*group_width_c)), id & "," & integer'image(7));
else generate
-- assert false report "m(?)(" & integer'image(j-(7*group_width_c)) & ")" severity note;
o(j) <= i(j-(7*group_width_c)) +
operation(twid_c, opt_pwr(1)(7), m(1)(j-(7*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(7), m(2)(j-(7*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(7), m(3)(j-(7*group_width_c)), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(7), m(4)(j-(7*group_width_c)), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(7), m(5)(j-(7*group_width_c)), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(7), m(6)(j-(7*group_width_c)), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(7), m(7)(j-(7*group_width_c)), id & "," & integer'image(7));
end generate;
end generate;
else generate
constant radix_c : positive := 8;
constant group_width_c : natural := o'length/radix_c;
constant powers_c : natural_vector := twiddle_power(radix_c);
constant opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1) := opt_pwr_arr(powers_c, group_width_c);
constant part_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1) := part_pwr_arr(powers_c, group_width_c);
signal t : complex_vector(0 to (2**log_num_inputs_g)-1);
signal m : complex_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1);
begin
-- Recurse and combine
dft_g : for j in 0 to radix_c-1 generate
dftr_i : entity work.dftr_real(radix8)
generic map (
log_num_inputs_g => log_num_inputs_g-3
)
port map (
i => i(j*i'length/8 to (j+1)*i'length/8-1),
o => t(j*t'length/8 to (j+1)*t'length/8-1)
);
end generate;
-- Combine
-- Reusable multiplications
col_g : for j in m'range generate -- (1 to radix_c-1)
row_g : for k in m(j)'range generate -- (0 to group_width_c-1)
constant id : string := m'instance_name & integer'image(j) & "," & integer'image(k);
begin
m(j)(k) <= operation(twid_c, part_pwr(j)(k), t(k+(j*group_width_c)), id);
end generate;
end generate;
-- m(1)(0) <= operation(twid_c, 0, i(1)); -- +1 x i(1)
-- m(1)(1) <= operation(twid_c, 4, i(1)); -- -1 x i(1),
-- m(1)(2) <= m(1)(0); -- 8
-- m(1)(3) <= m(1)(1); -- 12
--
-- m(2)(0) <= operation(twid_c, 0, i(2)); -- +1 x i(2)
-- m(2)(1) <= operation(twid_c, 2, i(2)); -- -i x i(2)
-- m(2)(2) <= operation(twid_c, 4, i(2)); -- -1 x i(2)
-- m(2)(3) <= operation(twid_c, 6, i(2)); -- +i x i(2)
--
-- m(3)(0) <= operation(twid_c, 0, i(3)); -- +1 x i(3)
-- m(3)(1) <= operation(twid_c, 6, i(3)); -- +i x i(3)
-- m(3)(2) <= operation(twid_c, 12, i(3)); -- -1 x i(3)
-- m(3)(3) <= operation(twid_c, 18, i(3)); -- -i x i(3)
--
-- m(4)(0) <= operation(twid_c, 0, i(4)); -- +1 x i(4)
-- m(4)(1) <= operation(twid_c, 1, i(4));
-- m(4)(2) <= operation(twid_c, 2, i(4)); -- -i x i(4)
-- m(4)(3) <= operation(twid_c, 3, i(4));
--
-- m(5)(0) <= operation(twid_c, 0, i(5)); -- +1 x i(5)
-- m(5)(1) <= operation(twid_c, 5, i(5));
-- m(5)(2) <= operation(twid_c, 10, i(5)); -- -i x i(5)
-- m(5)(3) <= operation(twid_c, 15, i(5));
--
-- m(6)(0) <= operation(twid_c, 0, i(6)); -- +1 x i(6)
-- m(6)(1) <= operation(twid_c, 3, i(6));
-- m(6)(2) <= operation(twid_c, 6, i(6)); -- +i x i(6)
-- m(6)(3) <= operation(twid_c, 9, i(6));
--
-- m(7)(0) <= operation(twid_c, 0, i(7)); -- +1 x i(7)
-- m(7)(1) <= operation(twid_c, 7, i(7));
-- m(7)(2) <= operation(twid_c, 14, i(7));
-- m(7)(3) <= operation(twid_c, 21, i(7));
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
butterfly_g : for j in o'range generate
constant group_c : natural := (j / group_width_c);
constant id : string := o'instance_name & integer'image(j);
begin
-- Optimised multiplication & sum as operation() avoids multiplications of 1, -i, -1 & i.
o(j) <= t(j-(group_c*group_width_c)) +
operation(twid_c, opt_pwr(1)(group_c), m(1)(j-(group_c*group_width_c)), id & "," & integer'image(1)) +
operation(twid_c, opt_pwr(2)(group_c), m(2)(j-(group_c*group_width_c)), id & "," & integer'image(2)) +
operation(twid_c, opt_pwr(3)(group_c), m(3)(j-(group_c*group_width_c)), id & "," & integer'image(3)) +
operation(twid_c, opt_pwr(4)(group_c), m(4)(j-(group_c*group_width_c)), id & "," & integer'image(4)) +
operation(twid_c, opt_pwr(5)(group_c), m(5)(j-(group_c*group_width_c)), id & "," & integer'image(5)) +
operation(twid_c, opt_pwr(6)(group_c), m(6)(j-(group_c*group_width_c)), id & "," & integer'image(6)) +
operation(twid_c, opt_pwr(7)(group_c), m(7)(j-(group_c*group_width_c)), id & "," & integer'image(7));
end generate;
end generate;
end architecture;
-- Just perform the bit reversal on the input indices, then call the recursive component.
architecture radix8 of dft_real is
begin
dftr_i : entity work.dftr_real(radix8)
generic map (
log_num_inputs_g => log_num_inputs_g
)
port map (
i => array_reverse(i),
o => o
);
end architecture;
Note that when the optimising twiddle factors do not simplify to something trivial, the VHDL code presently ends up describing two multiplications in series without any registers to pipeline the logic and break the propagation delay. Later it becomes obvious (if it wasn't already) that a Radix-8 implementation is not favourable, so this issue is noted and ignored.
Radix-n VHDL Code
use work.fft_real_pkg.all;
-- Perform the bit reversal on the input indices once at the top level, then call
-- the recursive component.
entity dft_multi_radix_real is
generic (
log_num_inputs_g : positive; -- E.g. 1024 point FFT => log_num_inputs_g = 10
max_radix_g : positive
);
port (
i : in complex_vector(0 to (2**log_num_inputs_g)-1);
o : out complex_vector(0 to (2**log_num_inputs_g)-1)
);
end entity;
use work.fft_real_pkg.all;
entity dftr_multi_radix_real is
generic (
log_num_inputs_g : positive; -- E.g. 1024 point FFT => log_num_inputs_g = 10. Using the logarithm here ensures an erroneous inputs width cannot be selected.
max_radix_g : positive -- Not using the logarithm here which is inconsistent! Conventionally everyone refers to Radix-2, Radix-4, Radix-8 etc.
);
port (
i : in complex_vector(0 to (2**log_num_inputs_g)-1);
o : out complex_vector(0 to (2**log_num_inputs_g)-1)
);
end entity;
library ieee;
use ieee.math_complex.all;
library local;
architecture radix_n of dftr_multi_radix_real is
constant radix_c : positive := minimum(max_radix_g, 2**log_num_inputs_g);
constant group_width_c : natural := o'length/radix_c;
constant twid_c : complex_vector(0 to (2**(log_num_inputs_g-1))-1) := init_twiddles_half(2**(log_num_inputs_g-1));
constant powers_c : natural_vector := twiddle_power(radix_c);
constant opt_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to radix_c-1) := opt_pwr_arr(powers_c, group_width_c);
constant part_pwr : natural_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1) := part_pwr_arr(powers_c, group_width_c);
signal t : complex_vector(0 to (2**log_num_inputs_g)-1);
signal m : complex_vector_arr_t(1 to radix_c-1)(0 to group_width_c-1);
begin
recurse_g : if (2**log_num_inputs_g) <= radix_c generate
-- No recursion, just map the inputs to the intermediate values
t <= i;
else generate
-- Recurse with radix_c instantiations
dft_g : for j in 0 to radix_c-1 generate
dftr_i : entity work.dftr_multi_radix_real
generic map (
log_num_inputs_g => log_num_inputs_g-local.math_pkg.ceil_log(radix_c),
max_radix_g => max_radix_g
)
port map (
i => i(j*i'length/radix_c to (j+1)*i'length/radix_c-1),
o => t(j*t'length/radix_c to (j+1)*t'length/radix_c-1)
);
end generate;
end generate;
-- Reusable multiplications
col_g : for j in m'range generate -- (1 to radix_c-1)
row_g : for k in m(j)'range generate -- (0 to group_width_c-1)
constant id : string := m'instance_name & integer'image(j) & "," & integer'image(k);
begin
m(j)(k) <= operation(twid_c, part_pwr(j)(k), t(k+(j*group_width_c)), id);
end generate;
end generate;
-- Combine with the radix_c butterfly
process(t, m)
variable s_v : complex := (re => 0.0, im => 0.0);
variable group_c : natural := 1;
begin
assert false
report "Adders: " & integer'image(o'length*(radix_c-1))
severity note;
for j in o'range loop -- j'th output
group_c := (j / group_width_c);
-- Generate the radix_c values to sum, and sum them cumulatively
s_v := t(j-(group_c*group_width_c));
for k in 1 to radix_c-1 loop
s_v := s_v + operation(twid_c, opt_pwr(k)(group_c), m(k)(j-(group_c*group_width_c)), m'instance_name & integer'image(j) & "," & integer'image(k));
end loop;
o(j) <= s_v;
end loop;
end process;
end architecture;
library local;
-- Just perform the bit reversal on the input indices, then call the recursive component.
architecture radix_n of dft_multi_radix_real is
begin
assert 2**local.math_pkg.ceil_log(max_radix_g) = max_radix_g
report "max_radix_g must be a power of 2."
severity error;
dftr_i : entity work.dftr_multi_radix_real
generic map (
log_num_inputs_g => log_num_inputs_g,
max_radix_g => max_radix_g
)
port map (
i => array_reverse(i),
o => o
);
end architecture;
The move to Radix-n results in a contraction of the code.
- The summations for each output are now performed by 'for' loop. This means the calculations for each output results in a nested 'for' loop. The indices used to index any array are now so generalised that it becomes too obfuscated to see how they might have been derived.
- Recursion is not 'terminated' in its own special case clause, merely the recursive instantiations are omitted at leaf nodes. We no longer require the stage 1 butterflies to bootstrap our understanding and their implementation is lost in the general combiner for any number of inputs for any radix.
- The max_radix_g generic is provided to specify the intended radix to use at each stage of recursion. It is a maximum because a smaller radix may be used in order to finish off the recursion with the remaining inputs that do not fill an instantiation of the desired radix size. As we now have multi-radix support we can recurse on the same Radix-n component simply by specifying the radix as a generic value. The inconsistency here is that whilst the number of inputs is specified as the log of the size, the radix is not just because of the user familiarity with the FFT terminology. Your mileage may vary here, each to their own OCD.
Arithmetic Usage
The desire is to check that my implementation so far obeys the "laws of FFT" by producing the right number of arithmetic operators, i.e. multipliers and adders.
Log Processing
The VHDL code has been instrumented with assert and report statements to log their instantiation or use. This actually causes a little grief because an instantiation produces a single log, but not always with a sufficiently unique path, whilst calling a function logs the arithmetic usage every time the function is called during simulation and needs de-duplicating. For this reason the multipliers were given a unique ID on which to de-duplicate the results. A Bash script running under Cygwin was used to parse the results.
Theory
For a p-point FFT the number of multipliers and adders in the fully parallel design should be given by the following equations.
Radix | Complex Multipliers | Complex Adders |
---|---|---|
2 | \(\frac{p}{2} \log_2(p)\) | \(p \log_2(p)\) |
4 | \(\frac{3p}{4} \log_4(p)\) | \(3p \log_4(p)=\frac{3}{2}p \log_2(p)\) |
8 | \(\frac{4\times7p}{8} \log_8(p)\) | \(7p \log_8(p)=\frac{7}{3}p \log_2(p)\) |
n | \( \begin{cases} \frac{(n-1)p}{n} \log_n(p) & \text{if } n \leq 4\\ \frac{(n-1)(n-4)p}{n} \log_n(p) & \text{if } n > 4 \end{cases} \) | \((n-1)p \log_n(p)=\frac{n-1}{\log_2(n)}p \log_2(p)\) |
When adding n items, (n-1) two input additions are required. The number of products to sum at each output is equal to the radix. The log term is the depth of recursion, or the number of layers of application of the Radix-n FFT. I therefore believe the Digital Signal Processing: A User's Guide to be incorrect in this regard as each Radix-4 butterfly requires 12 additions not 8 as stated.
I note that the theoretical ratio of Radix-4 multipliers to Radix-2, given by \({{\frac{p}{2} \log_2(p)} \over {\frac{3p}{4} \log_4(p)}}=\frac{3}{4}\). Also the theoretical ratio of Radix-4 adders to Radix-2, given by \({{3p \log_4(p)} \over {p \log_2(p)}}=\frac{3}{2}\). These can be verified with the results below.
Radix-8 Expected Size
I am a little uncertain I have correctly identified the theoretical formula for the number of multipliers in the Radix-8 and hence Radix-n FFT. The number of multipliers for the partial powers continues to be \(\frac{n-1}{n}\) (for Radix-n) per instantiations, and now of the 8 optimising powers 4 are not, 4 are now non-trivial multipliers. Therefore, each instantiation now has \(\frac{4\times7}{8}=3.5\) multipliers and the formula \(\frac{7p}{2} \log_8(p)\) is a good fit, consistent with the results from Radix-2 and Radix-4. I have been unable to confirm this deduction from any online literature, but the measured results do look consistent.
Results
Radices 2, 4 & 8 results tables below show the number of complex arithmetic operators used, and a single chart followind the three tables shows their relative growth. Finally use is made of the Radix-n code to illustrate a fixed 64-point FFT performed by varying the radix to confirm what the literature tells us to expect.
Wherever the terms "multipliers" or "adders" are referenced in the results I mean more precisely "complex multipliers" and "complex adders", which clearly doubles the figures to cater for real and imaginary parts of the numbers.
Radix-2 Construction
Inputs | Matrix Multiply | Actual Multipliers | Theoretical Multipliers | Ratio (Actual / Theoretical) | Adders |
---|---|---|---|---|---|
2 | 4 | 0 | 1 | 0.0% | 2 |
4 | 16 | 0 | 4 | 0.0% | 8 |
8 | 64 | 2 | 12 | 16.7% | 24 |
16 | 256 | 10 | 32 | 31.3% | 64 |
32 | 1024 | 34 | 80 | 42.5% | 160 |
64 | 4,096 | 98 | 192 | 51.0% | 384 |
128 | 16,384 | 258 | 448 | 57.6% | 896 |
256 | 65,536 | 642 | 1024 | 62.7% | 2,048 |
512 | 262,144 | 1,538 | 2,304 | 66.8% | 4,608 |
1,024 | 1,048,576 | 3,586 | 5,120 | 70.0% | 10,240 |
2,048 | 4,194,304 | 8,194 | 11,264 | 72.7% | 22,528 |
4,096 | 16,777,216 | 18,434 | 24,576 | 75.0% | 49,152 |
8,192 | 67,108,864 | 40,962 | 53,248 | 76.9% | 106,496 |
16,384 | 268,435,456 | 90,114 | 114,688 | 78.6% | 229,376 |
32,768 | 1,073,741,824 | 196,610 | 245,760 | 80.0% | 491,520 |
65,536 | 4,294,967,296 | 425,986 | 524,288 | 81.3% | 1,048,576 |
The difference between the theoretical and actual number of multipliers can be accounted for simply by the number of partial powers that happen to turn into trivial multipliers opportunistically rather than strategically (by planning with optimising powers).The design grows at the required \(p \log(p)\) rate for a p-point FFT. There are no differences between the expected and actual number of adders.
Radix-4 Construction
Inputs | Matrix Multiply | Actual Multipliers | Theoretical Multipliers | Ratio (Actual / Theoretical) | Savings over Radix-2 | Adders |
---|---|---|---|---|---|---|
4 | 16 | 0 | 3 | 0.0% | 12 | |
16 | 256 | 8 | 24 | 33.3% | 80.0% | 96 |
64 | 4,096 | 76 | 144 | 52.8% | 77.6% | 576 |
256 | 65,536 | 492 | 768 | 64.1% | 76.6% | 3,072 |
1,024 | 1,048,576 | 2,732 | 3,840 | 71.1% | 76.2% | 15,360 |
4,096 | 16,777,216 | 13,996 | 18,432 | 75.9% | 75.9% | 73,728 |
16,384 | 268,435,456 | 68,268 | 86,016 | 79.4% | 75.8% | 344,064 |
65,536 | 4,294,967,296 | 322,220 | 393,216 | 81.9% | 75.6% | 1,572,864 |
As with the Radix-2 results, the difference between the theoretical and actual number of multipliers can be accounted for simply by the number of partial powers that happen to turn into trivial multipliers opportunistically rather than strategically (by planning with optimising powers).The design grows at the required \(p \log(p)\) rate for a p-point FFT. There are no differences between the expected and actual number of adders.
Theory tells us to expect a saving with Radix-4 over Radix-2 of 25%, and that is what we seem to get here asymptotically as the point size of the FFT tends to infinity. The growth in adders is steeper with Radix-4 than with Radix-2, 50% extra, so you trade multipliers for adders. These results agree with the theory.
Radix-8 Construction
Inputs | Matrix Multiply | Actual Multipliers | Theoretical Multipliers | Ratio (Actual / Theoretical) | Savings over Radix-2 | Adders |
---|---|---|---|---|---|---|
8 | 64 | 16 | 28 | 57.1% | 800% | 56 |
64 | 4,096 | 304 | 448 | 67.9% | 310% | 896 |
512 | 262,144 | 3,896 | 5,376 | 72.5% | 253% | 10,752 |
4,096 | 16,777,216 | 42,936 | 57,344 | 74.9% | 233% | 114,688 |
32,768 | 1,073,741,824 | 437,688 | 573,440 | 76.3% | 223% | 1,146,880 |
The Radix-8 design grows at the required \(p \log(p)\) rate for a p-point FFT. For Radix-8 we have the increase in multipliers over Radix-2 that literature tells us to expect from the use of non-trivial multipliers in what was previously termed by me as the "optimising powers". Both the number of multipliers and adders now grow as 7⁄3 when compared to Radix-2, with one slight concern for the 32768-point FFT. This particular seems to be slightly more efficient in multipliers at 22⁄9ths rather than 23⁄9ths the size of the Radix-2 equivalent. I offer no further explanation beyond having a smaller number of further opportunistic optimisations.
Multiplication Operator Growth
Bringing the data from the previous three tables together, this chart illustrates the growth in complex multipliers for radices 2, 4 & 8. The graph is plotted as log-log axes because its the best way to show the trends of all the plots and control the large numbers involved.

Radix-n Construction
Taking the example of a 64-point FFT which can be neatly implemented in radices 2, 4, 6 & 64 as an unmixed (pure?) FFT, I can graphically show how the resource consumption grows with increasing radix.
Radix | Actual Multipliers | Theoretical Multipliers | Adders |
---|---|---|---|
2 | 98 | 192 | 384 |
4 | 76 | 144 | 576 |
8 | 304 | 448 | 896 |
64 | 3,328 | 3,780 | 4,032 |

There are no more savings in arithmetic to be had by increasing the radix beyond 4. The realistic choices are between Radix-2 and Radix-4 based on the availability or cost of addition and multiplication operators.
End of Part 2
Its is always a relief when your results match the theory and common expectations.
- The number of multipliers does indeed follow a trend of \(n \log(n)\).
- The Radix-4 implementation provides an approximate 75% saving over the Radix-2 implementation in terms of complex multipliers at the expense of a 50% increase in additions.
- The Radix-8 implementation is vastly more expensive being 2.33 times the size of Radix-2 for both multiplication and addition over the size of Radix-2.
One practical aspect of the FFT implementation so far ignored completely is the scaling factor between stages. The DFT or Fourier Matrices are defined with a scaling factors of \(\frac{1}{\sqrt{n}}\) for an Radix-n FFT. So for each stage of application the results should be scaled by this amount. E.g. in the case of Radix-2 with a scaling factor of \(\frac{1}{\sqrt{2}}\), division by 2 could be done every other stage of application, which is just a single bit shift. I note that in practice the scaling of values relative to the representation is more varied. Noise might be quite flat across the spectrum where as a single frequency sine wave gives a large peak. As this is a simple practical matter which can be solved through simulation to verify there is no overflow in the number representation, it is left on the list of possible future work knowing that a solution exists and can be added when the application is known.
All that remains for the final part of this series of blogs is to address the synthesis issue and move from VHDL's real type to sfixed.
References
- Github Source Code
- Cooley-Tukey fast Fourier transform algorithm
- An Algorithm for the Machine Calculation of Complex Fourier Series, James W. Cooley and John W. Tukey - Original Paper for historical curiosity
- A DFT and FFT Tutorial
- Lecture notes: The Fast Fourier Transform; Radix–2 and Radix–4 Algorithms, Kenneth E. Barner, University of Delaware