I've often wondered how pleasing it would be to provide a generic FIR Filter with just a few configuration parameters and have the whole thing spring to life. Yes, much like the Xilinx IP Catalogue might do, but only purely with VHDL so that parameters can be changed really easily. The Transpose and Systolic architectures provide a simple iterative approach. Here I wanted to solve the problem of how to construct a pipelined adder tree so that the clock speed was not diminished by a large number of additions being done in a single clock cycle. This is another toy problem like the Large Comparators Pipelined Efficiently by Recursion, with the additional challenge that as you add values together you need more bits to hold the result. Therefore the number of bits of the sum at the output will be larger than the number of bits required by any input operand. The design will need to track that growth in partial results all the way up the tree to the output.
- FIR Filter Application
- Principles
- Non-pipelined Balanced Adder Tree
- Code Structure
- Pipelined Balanced Adder Tree
- Code Structure
- Synthesised Structure
- Preventing Logic Bloat
- Choosing the Divider
- Construction Results
- Tool Support
- Conclusions
- References
FIR Filter Application
Nothing special here except how the VHDL code is set up so that a filter can be specified.

type input_arr_t is array (natural range <>) of signed;
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use work.adder_tree_pkg.all;
entity fir_filter is
generic (
coeffs_g : input_arr_t;
input_width_g : positive
);
port (
clk : in std_logic;
reset : in std_logic;
data_in : in signed(input_width_g-1 downto 0);
-- data_in'length can't be accessed at this point, so use an extra generic 'input_width_g'.
-- ** Error: ...\Adder_Tree\fir_filter.vhdl(15): (vcom-1396) Object "data_in" cannot appear within the same interface list in which it is declared.
data_out : out signed(output_bits(input_width_g + coeffs_g(0)'length, coeffs_g'length)-1 downto 0)
);
end entity;
architecture rtl of fir_filter is
signal data_reg : input_arr_t(coeffs_g'range)(data_in'range);
signal mult_arr : input_arr_t(coeffs_g'range)(input_width_g + coeffs_g(0)'length - 1 downto 0);
begin
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
data_reg <= (others => (others => '0'));
mult_arr <= (others => (others => '0'));
else
data_reg <= data_in & data_reg(0 to data_reg'high-1);
for i in coeffs_g'range loop
mult_arr(i) <= data_reg(i) * coeffs_g(i);
end loop;
end if;
end if;
end process;
adder_tree_pipe_i : entity work.adder_tree_pipe
generic map (
depth_g => ceil_log(coeffs_g'length, 2), -- Best attempt at a binary tree with single adder between register stages
num_coeffs_g => coeffs_g'length,
input_width_g => mult_arr(0)'length
)
port map (
clk => clk,
reset => reset,
i => mult_arr,
o => data_out
);
end architecture;
The coeffs generic is an array of signed vectors. The array length defines how many operands are summed by the adder tree. The input_width_g generic is the width of the signed values in the incoming data stream. This was specified rather than taking it from the length of an unconstrained data_in vector (i.e. type std_logic_vector with no bounds specified) because I also wanted to constrain the size of the output vector in data_out. A function is used to calculate the upper bound of the output for conciseness. Here it is not possible to substitute data_in'length for the generic name due to compiler complaints.
Object "data_in" cannot appear within the same interface list in which it is declared.
The implementation is a simple shift register for data written in that is the length of the array of constant coefficients, with as many multipliers. The results from the multipliers then feed the base of the adder tree, with results percolating up the pipeline to the adder tree's output taking one cycle longer than the pipelined depth of the adder tree. So all I need now is the self-assembling, space efficient adder tree pipelined to my choice of delay so that I can titrate the functional clock delay with clock speed.
Principles
- The designer must be able to specify the delay through the instantiation in order to be able to match the delay in any parallel logic.
- The design should cope with the full range of realistic functional parameters, i.e. widths should not be constrained to a helpful subset of values like powers of 2.
- The design should be optimal in size for a given depth of pipeline requested.
- The design must be able to reach maximal clock speeds, such that the designer may trade speed for size.
- The design should behave well under non-ideal situations. For example, if more clock cycles are requested than are actually required for maximum clock speed, the pipeline should be padded efficiently and avoid pathological logic bloat.
Non-pipelined Balanced Adder Tree
Building a balanced asynchronous adder tree requires recursion of its own. Hence this is an instructive sub-problem worth some attention. It also turns out that it is necessary component for the way I wish to solve the pipelined adder tree too.
Code Structure
Its obvious that as adders have two inputs, the most speedy implementation will add two operands between each pair of registers. This however might be over kill for small width operands and it will force a minimum depth of tree. Therefore I want to leave the choice of how much work to do between each register to the designer instantiating the adder tree. Hence I need a component that will add one or more operands (yes I mean as few as one) in a single (hierarchical) component.
library ieee;
use ieee.numeric_std.all;
use work.adder_tree_pkg.all;
-- An entity with:
-- i : in input_arr_t(open)(input_width_g-1 downto 0);
-- Broke the vcom compiler. Using fully constrained arrays which will aid length checking anyway.
entity adder_tree is
generic (
num_operands_g : positive; -- ModelSim is struggling with an unconstrained array of constrained vectors
input_width_g : positive := 18
);
port (
i : in input_arr_t(0 to num_operands_g-1)(input_width_g-1 downto 0);
o : out signed(output_bits(input_width_g, num_operands_g)-1 downto 0)
);
end entity;
architecture rtl of adder_tree is
begin
recurse_g : if num_operands_g = 1 generate
-- End recursion
output1: o <= i(0); -- No resize as log2(1) = 0 additional bits on the output.
elsif num_operands_g = 2 generate
output2 : o <= resize(i(0), o'length) + resize(i(1), o'length);
elsif num_operands_g = 3 generate
signal o1 : signed(output_bits(input_width_g, 2)-1 downto 0);
begin
adder_i : entity work.adder_tree
generic map (
num_operands_g => 2,
input_width_g => input_width_g
)
port map (
i => i(0 to 1),
o => o1
);
output3 : o <= resize(o1, o'length) + resize(i(2), o'length);
else generate
constant num_coeffs1_c : natural := first_adder_operands(num_operands_g);
signal o1 : signed(output_bits(input_width_g, num_coeffs1_c)-1 downto 0);
signal o2 : signed(output_bits(input_width_g, num_operands_g - num_coeffs1_c)-1 downto 0);
begin
adder1_i : entity work.adder_tree
generic map (
num_operands_g => num_coeffs1_c,
input_width_g => input_width_g
)
port map (
i => i(0 to num_coeffs1_c-1),
o => o1
);
adder2_i : entity work.adder_tree
generic map (
num_operands_g => (num_operands_g - num_coeffs1_c),
input_width_g => input_width_g
)
port map (
i => i(num_coeffs1_c to num_operands_g-1),
o => o2
);
output4plus : o <= resize(o1, o'length) + resize(o2, o'length);
end generate;
end architecture;
This code uses a local package for some specialist functions. For example output_bits() calculates the size of the output vector required for a number of operands, num_coeffs_g, each of size input_width_g. Essentially the number of output bits grows by \(\log(number\ of\ operands)\), but this has to be rounded up.
function ceil_log(
constant v : positive;
constant b : positive
) return natural is
begin
return natural(ceil(log(real(v), real(b))));
end function;
function output_bits(
constant input_width : positive;
constant num_coeffs : positive
) return positive is
begin
return input_width + ceil_log(num_coeffs, 2);
end function;
In case you've not seen this before, the growth in the sum vectors is worth a little explanation. When summing two 8-bit inputs, the output will be 9 bits, one bit longer. Essentially adding two 8 bit values is equivalent to multiplication by 2, or a bit shift. Or if we add two values in the range -27:(27-1) we create a value in the range -28:(28-2) which requires a 9-bit signed vector. Now if we specify the expected output width on the outer component that we instantiate, everything inside must be created to match no matter how deep the recursion. Resizing the signed operands in VHDL is trivial with the ieee.numeric_std.resize() function, which pads values to a required width for each adder output in order to match the sizes correctly, and includes sign bit extension too.
first_adder_operands() calculates the number of operands to consume with the first adder in the binary recursion. The second recursion will use what is left over. These two recursions form the added pair of sub-sums at this level. Special cases are made for a single operand which is a wire and terminate recursion, two operands which add the pair and terminate recursion, and three operands which only recurses on one adder input before summing. The output assignments are labelled so that they can be identified clearly in ModelSim's hierarchy viewer. In this way, the non-pipelined adder tree takes care of the constraint of binary adder operators (logic primitives or derived logic) and provides a simple abstraction of addition of multiple operands.
function first_adder_operands(
constant num_coeffs : positive
) return positive is
begin
return positive(ceil(real(num_coeffs) / 2.0));
end function;
As there is no pipelining involved, there is no essential need to match delays on each branch of the tree. The main desire is to keep all branches of a similar length such that we achieve the minimum propagation delay possible through the adder tree. The first_adder_operands() function essentially divides the number of operands by 2, but must gracefully handle odd numbers, so it converts the numerator and denominator of the division to real values in order to round up (ceil()) the result to ensure that the first half split is the larger for odd values. At this point in the explanation, it would not matter if the value was rounded down and hence the function could use integer maths, as that would make the second half larger for odd values. But the function will be re-used later for the pipelined adder where the rounding direction is important for divisors larger than 2.


Pipelined Balanced Adder Tree
This requires the construction of an n-ary tree. The re-use of the adder_tree component means that we can allow the designer to specify the pipeline depth without restrictions. If a relatively small depth is chosen for the specified number for operands, then more than a single addition will be performed at each level of recursion. If a relatively large value of pipeline depth is chosen, then we need to pad the functional timing.


The above illustrates a construction of a pipelined adder tree showing the structure for a depth of 3 pipeline stages and 13 inputs. The first stage division gives recursion on 7 & 6 operands. The 6 side is then divided into 3 & 3, whilst the 7 side is divided into 3, 3 & 1. Note the bottom leaf node is missing an adder tree as its just a simple delay. In this case, it is not possible to come up with a configuration without the simple delay in the leaf node without increasing the division of work at any stage above to more than 3.
Code Structure
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use work.adder_tree_pkg.all;
-- An entity with:
-- i : in input_arr_t(open)(input_width_g-1 downto 0);
-- Broke the vcom compiler. Using fully constrained arrays which will aid length checking anyway.
entity adder_tree_pipe is
generic (
depth_g : positive;
num_operands_g : positive; -- ModelSim is struggling with an unconstrained array of constrained vectors
input_width_g : positive := 18
);
port (
clk : in std_logic;
reset : in std_logic;
i : in input_arr_t(0 to num_operands_g-1)(input_width_g-1 downto 0);
o : out signed(output_bits(input_width_g, num_operands_g)-1 downto 0)
);
end entity;
library ieee;
use ieee.math_real.all;
architecture rtl of adder_tree_pipe is
-- For debug only, extracted by VHDL-2008 signal spies.
constant output_width_c : positive := o'length;
constant divide_c : positive := recurse_divide(num_operands_g, depth_g);
constant part_length_c : positive := positive(ceil(real(num_operands_g) / real(divide_c)));
constant sum_input_bits_c : positive := output_bits(input_width_g, part_length_c);
signal sum : input_arr_t(0 to divide_c-1)(sum_input_bits_c-1 downto 0);
signal lsum : signed(output_bits(sum_input_bits_c, divide_c)-1 downto 0);
begin
recurse_g : if depth_g > 1 generate
-- Recurse
divide_g : for l in 0 to divide_c-1 generate
constant ilow_c : natural := l * part_length_c;
-- Remaining coefficients might be less than 'part_length_c'
constant ihigh_c : natural := work.adder_tree_pkg.minimum(num_operands_g, ilow_c + part_length_c)-1;
constant output_bits_c : positive := output_bits(input_width_g, (ihigh_c - ilow_c + 1));
begin
adder_tree_pipe_i : entity work.adder_tree_pipe
generic map (
depth_g => depth_g-1,
num_operands_g => (ihigh_c - ilow_c + 1),
input_width_g => input_width_g
)
port map (
clk => clk,
reset => reset,
i => i(ilow_c to ihigh_c),
o => sum(l)(output_bits_c-1 downto 0)
);
msbs_g : if output_bits_c < sum_input_bits_c generate
-- Must sign extend, might be multiple bits
sum(l)(sum_input_bits_c-1 downto output_bits_c) <= (others => sum(l)(output_bits_c-1));
end generate;
end generate;
else generate
-- Terminate recursion, just pass values directly to the non-pipelined adder.
bypass : sum <= i;
end generate;
adder_g : if divide_c > 1 generate
adder_i : entity work.adder_tree
generic map (
num_operands_g => divide_c,
input_width_g => sum_input_bits_c
)
port map (
i => sum,
o => lsum
);
else generate
bypass : lsum <= sum(0);
end generate;
-- Pipeline the adder tree here, before returning from recursion
reg_output : process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
-- Reset should be replaced by GSR initialisation for FPGA.
o <= (others => '0');
else
-- LHS is (output_bits(input_width_g, num_operands_g)-1 downto 0) due to excessive bit growth, so chop extra bit off MSB end.
o <= lsum(o'range);
end if;
end if;
end process;
end architecture;
divide_c is the number of branches required at this stage of recursion, the method is detailed below once the logic bloat problem is solved. The number of operands are then divided up amongst the divide_c recursion stages. Here it is important to have rounded non-integer quotients up rather than down. For example, if you had 40 outputs and divided by 3 with rounding down, you would service 13 + 13 +13 operands and be left with one operand excluded. By rounding 13.33 up you would service 14 + 14 +12. An alternative solution of 14 + 13 +13 ought to be obtainable by re-calculating the division after removing the first 14 operands, but that requires additional complexity for no gain (except perhaps for a more aesthetically pleasing construction). part_length_c is the first (or maximum) of these operand widths used to assign operands from the input to the recursion inputs.
We also have to deal with the growing output vector width in the face of an unbalanced tree. Add to this problem, the constraint that a VHDL array of vectors must have equally sized vectors. So in the case that we split our recursion work in the ratio 3:3:1, the first two recursed trees will grow their outputs by \(\lceil\log_2(3)\rceil = 2\) bits whilst the last recursed tree (as straight copy and delay) will not increase the size of the output. The msbs_g clause is used to test if there is a mis-match in size, and here it is important to propagate the sign bit into the gap which may be more than one bit as just illustrated.
Synthesised Structure
The clock speed for the realised structures will be so variable on operand width, device or technology and design congestion that the main unit consided for measuring design speed will be a simple logic depth, the maximum depth of adders between a pair of registers. Fewer adders means faster clock speeds. No fmax (maximum clock frequency) here. Optimum tree depth must scale with the \(\log()\) of the number of operands being summed. The illustration below has kept the contents of the non-pipelined adder trees closed to reduce clutter. Examples of adder tree construction are illustrated above.

Preventing Logic Bloat
When padding the functional timing to meet an excessive pipeline depth we need to be efficient. The problem is illustrated below. Registers must be applied to the single output, not to the many inputs for utilisation efficiency. The decision on whether to pad must be made prior to recursing, so it will be too late to reach the inputs and realise that the work has run out.

Choosing the Divider
To prevent this logic bloat problem we must calculate if we can achieve the same amount of work with a smaller division at this level of recursion. If so, do less work now (less division) so that the tree is bottom heavy with work. The recurse_divide() function therefore works out the inescapable maximum division across the whole tree, and then tests if a smaller division now is possible without increasing the division required later. Remember the division is a proxy for adder depth, \(adder\_depth = \lceil\log_2(divide\_c)\rceil\), but more convenient to use. Hence the for..loop in the function below. If the depth is now 1, there's no option so just return the remaining division to be done.
function recurse_divide(
constant num_coeffs : positive;
constant depth : positive
) return positive is
variable divide : positive := ceil_root(num_coeffs, depth);
variable ncl : positive;
begin
if depth > 1 then
-- Try to divide by less the 'divide'
for i in 1 to divide-1 loop
-- Divide by 'i' now and do the remainder 'ncl' in 'depth-1' levels.
ncl := positive(ceil(real(num_coeffs) / real(i)));
if ceil_root(ncl, depth-1) = divide then
-- Trial division found a better answer
return i;
end if;
end loop;
-- No better answer
return divide;
else
-- No depth left to play with, must divide.
return divide;
end if;
end function;
A subtle difference between the pipelined adder tree and the pipelined comparator is in how the amount of division for recursion is chosen. With the comparator, once a (6 input) LUT was required at all, it was most efficient in LUT size to pack it out as the utilisation cost had already been paid. That meant picking off a "power of \(lut\_size\)" (such as 1, 6, 36 etc) bits to pack out the LUTs' usage of inputs. It was worth making sure the division was not just bottom heavy to prevent unnecessary registers, but also side heavy to pack the LUTs to use as many inputs as had been made available in the LUT trees. With adders, either you use both inputs or you have no adder at all. There is no utilisation efficiency to be gained by packing the operands consumed at a level of recursion to a power if 2 (e.g. 1, 2, 4, 8 etc) like it was with LUTs being packed to a "power of \(lut\_size\)". Therefore the we can use simple division with no penalty instead of division by a power.
This concludes the derivation of the mathematical problem, and the construction of the most size efficient design for a given requested depth of pipelining. Naturally the code will be factored into a package like the one I indicate below. I'll now move on to some testing and results.
library ieee;
use ieee.numeric_std.all;
package adder_tree_pkg is
-- Defaults to "to" range
type input_arr_t is array (natural range <>) of signed;
-- Return the minimum of 'a' and 'b'.
--
-- This function has not been implemented for 'positive' in some tools! Where it
-- has been implemented, the function's presence causes ambiguity. Helpful...
--
-- Quartus Prime:
-- Error (10482): VHDL error at file.vhdl(xx): object "minimum" is used but not declared
-- Error: Quartus Prime Analysis & Synthesis was unsuccessful. 1 error, 0 warnings
--
-- ModelSim: ** Error: file.vhdl(xx): Subprogram "minimum" is ambiguous.
--
-- Therefore qualification by full path name might be required,
-- e.g. 'local.math_pkg.minimum(..)'.
--
-- Usage:
-- constant min : positive := minimum(4, width_g);
--
function minimum(a, b : positive) return positive;
-- Return the ceil(log(n, base)), i.e. round up the result of log(n, b), log to
-- base b of v.
--
-- Made public as it can be used to configure the pipelined adder tree depth.
-- E.g.
-- adder_tree_pipe_i : entity work.adder_tree_pipe
-- generic map (
-- depth_g => ceil_log(coeffs'length, 2),
-- num_operands_g => coeffs'length,
-- input_width_g => mult_arr(0)'length
-- )...
--
-- Example results:
--
-- v b log_b(v) Return(r) b^r
-- ---------------------------------
-- 4 2 2.00 2 4
-- 7 2 2.81 3 8
-- 13 2 3.70 4 16
-- 8 3 1.89 2 9
-- 9 3 2.00 2 9
-- 26 3 2.97 3 27
-- 27 3 3.00 3 27
-- 28 3 3.03 4 81
-- 16 4 2.00 2 16
-- 17 4 2.04 3 64
--
function ceil_log(
constant v : positive;
constant b : positive
) return positive;
-- Return the ceil(root'th root of n), i.e. round up the result of taking the root.
-- This version only allows positive integer roots.
--
-- Made public as it can be used to verify construction in the test bench.
--
-- b'th root of v
--
-- Example results:
--
-- v b root Return(r) r^b
-- ------------------------------
-- 4 2 2.00 2 4
-- 8 3 2.00 2 8
-- 26 3 2.96 3 27
-- 27 3 3.00 3 27
-- 28 3 3.04 4 64
-- 40 3 3.42 4 64
-- 40 4 2.51 3 81
-- 40 5 2.09 3 243
-- 40 6 1.85 2 64
-- 80 5 2.40 3 243
--
function ceil_root(
constant v : positive;
constant b : positive
) return positive;
-- The amount to divide and recurse on now for a minimum depth of logic between registers.
--
-- Parameters:
-- * num_operands - The number of coefficients left to service.
-- * depth - The pipeline depth left in which to service the ocefficients.
--
-- The basic calculation provides a macro value for the remaining amount of work over all depths.
-- Then a test is done to see if doing less work now does not increase the adder depth later. If
-- not, do less now and ensure the tree is bottom heavy for the purposes of minimising logic
-- utilisation.
--
function recurse_divide(
constant num_operands : positive;
constant depth : positive
) return positive;
-- Given an adder with a pair of input operands, calculate the number of coefficients to recurse with on the first input, with the remainder recurse on for the second input. The first half will be the larger half for an odd number of coefficients.
function first_adder_operands(
constant num_operands : positive
) return positive;
function output_bits(
constant input_width : positive;
constant num_operands : positive
) return positive;
function to_input_arr_t (
i : integer_vector;
w : positive
) return input_arr_t;
function reverse(i : input_arr_t) return input_arr_t;
function calc_sum_width(
c : input_arr_t;
input_width : positive
) return integer_vector;
end package;
library ieee;
use ieee.math_real.all;
package body adder_tree_pkg is
-- Not been implemented for 'positive' in some tools! Where it has been implemented, the function's
-- presence causes ambiguity. Helpful...
--
-- Quartus Prime:
-- Error (10482): VHDL error at comparator.vhdl(85): object "minimum" is used but not declared
-- Error: Quartus Prime Analysis & Synthesis was unsuccessful. 1 error, 0 warnings
--
-- ModelSim: ** Error: A:/Philip/Work/VHDL/Comparator/comparator.vhdl(89): Subprogram "minimum" is ambiguous.
--
function minimum(a, b : positive) return positive is
begin
if a < b then
return a;
else
return b;
end if;
end function;
function ceil_log(
constant v : positive;
constant b : positive
) return natural is
begin
return natural(ceil(log(real(v), real(b))));
end function;
function ceil_root(
constant v : positive;
constant b : positive
) return positive is
variable ret : positive := 1;
begin
while ret**b < v loop
ret := ret + 1;
end loop;
return ret; -- ceil(v**(1/b))
end function;
function recurse_divide(
constant num_operands : positive;
constant depth : positive
) return positive is
-- Warning (10542): VHDL Variable Declaration warning at adder_tree_pkg.vhdl(145): used initial value expression
-- for variable "divide" because variable was never assigned a value
-- Variable assignment moved below to solve this Quartus Prime warning.
variable divide : positive; -- := ceil_root(num_operands, depth);
variable ncl : positive;
begin
divide := ceil_root(num_operands, depth);
if depth > 1 then
-- Try to divide by less the 'divide'
for i in 1 to divide-1 loop
-- Divide by 'i' now and do the remainder 'ncl' in 'depth-1' levels.
ncl := positive(ceil(real(num_operands) / real(i)));
if ceil_root(ncl, depth-1) = divide then
-- Trial division found a better answer
return i;
end if;
end loop;
-- No better answer
return divide;
else
-- No depth left to play with, must divide.
return divide;
end if;
end function;
-- Use ceil not floor otherwise not all bits will get used
function first_adder_operands(
constant num_operands : positive
) return positive is
begin
return positive(ceil(real(num_operands) / 2.0));
end function;
function output_bits(
constant input_width : positive;
constant num_operands : positive
) return positive is
begin
return input_width + ceil_log(num_operands, 2);
end function;
function to_input_arr_t (
i : integer_vector;
w : positive
) return input_arr_t is
variable ret : input_arr_t(i'range)(w-1 downto 0);
begin
for j in i'range loop
ret(j) := to_signed(i(j), w);
end loop;
return ret;
end function;
function reverse(i : input_arr_t) return input_arr_t is
variable ret : input_arr_t(i'range)(i(0)'range);
begin
for j in i'range loop
ret(j) := i(i'high - j);
end loop;
return ret;
end reverse;
function calc_sum_width(
c : input_arr_t;
input_width : positive
) return integer_vector is
variable ret : integer_vector(c'range);
begin
for i in c'range loop
ret(i) := output_bits(input_width + c(0)'length, c'length-i);
end loop;
return ret;
end function;
end package body;
Construction Results
Modelsim's hierarchy viewer provides a good illustration of how these components are constructed. There's an overall recursive structure for the pipelining, and then individual recursive non-pipelined adders sprinkled throughout. The first figure below shows a smaller example (DUT 2) fully expanded to include the recursion in the adders under an instantiation adder_i. Typically we are only dealing with a relatively small number of operands to each non-pipelined adder. As the depth becomes small relative to the number of operands, these non-pipelined adders get larger. The second figure shows a larger example (DUT 6) with the pipelined adder tree fully expanded but with the non-pipelined adders contracted.


The test results are awkward to tabulate, so the following table requires an explanation.
- The test parameters for DUT x are in bold, namely Pipeline Depth, Operands, and Input Width. These 3-tuples are held in an array of DUT tests, and passed to the generics of each generated DUT. The table below lists the results for 9 such DUTs
- When each DUT is instantiated, the VHDL-2008 version of signal spies are used to extract the divisors, number of operands (num_coeffs) and output width used at each level of hierarchy down the first and most packed path. These are the figures listed in a 'sub-table' in the last four columns under "Statistics for top path of recursion of the tree where logic is most densely packed.".
- The "Expected Division" is calculated from the initial DUT generics, and represents an easy calculation against which to compare "Maximum Division" from construction.
- The "Maximum Division" value is then quite literally the maximum value from the Divide column. This is the largest number of recursions along the examined path, related to the adder depth and hence the minimum clock period the design will clock at. If this value exceeds the "Expected Division" value then construction has failed.
Hence the test bench is also able to verify the construction is as expected by ensuring:
- no level of recursion divides by more than expected, and
- that a level of hierarchy divides the same or more than the level above.
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
---|---|---|---|---|---|
DUT | 0 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 1 | 1 | 2 | 2 | 9 |
Operands | 2 | ||||
Input Width | 8 | ||||
Expected Division | 2 | ||||
Maximum Division | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 1 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 2 | 2 | 2 | 1 | 9 |
Operands | 2 | 1 | 2 | 2 | 9 |
Input Width | 8 | ||||
Expected Division | 2 | ||||
Maximum Division | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 2 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 2 | 2 | 3 | 2 | 11 |
Operands | 3 | 1 | 2 | 2 | 10 |
Input Width | 9 | ||||
Expected Division | 2 | ||||
Maximum Division | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 3 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 2 | 2 | 4 | 2 | 12 |
Operands | 4 | 1 | 2 | 2 | 11 |
Input Width | 10 | ||||
Expected Division | 2 | ||||
Maximum Division | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 4 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 5 | 5 | 5 | 1 | 14 |
Operands | 5 | 4 | 5 | 1 | 14 |
Input Width | 11 | 3 | 5 | 2 | 14 |
Expected Division | 2 | 2 | 3 | 2 | 13 |
Maximum Division | 2 | 1 | 2 | 2 | 12 |
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 5 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 2 | 2 | 6 | 2 | 15 |
Operands | 6 | 1 | 3 | 3 | 14 |
Input Width | 12 | ||||
Expected Division | 3 | ||||
Maximum Division | 3 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 6 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 3 | 3 | 7 | 2 | 16 |
Operands | 7 | 2 | 4 | 2 | 15 |
Input Width | 13 | 1 | 2 | 2 | 14 |
Expected Division | 2 | ||||
Maximum Division | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 7 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 4 | 4 | 40 | 2 | 14 |
Operands | 40 | 3 | 20 | 3 | 13 |
Input Width | 8 | 2 | 7 | 3 | 11 |
Expected Division | 3 | 1 | 3 | 3 | 10 |
Maximum Division | 3 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 8 | Depth | Operands | Divide | Output Width |
Pipeline Depth | 3 | 3 | 80 | 4 | 15 |
Operands | 80 | 2 | 20 | 4 | 13 |
Input Width | 8 | 1 | 5 | 5 | 11 |
Expected Division | 5 | ||||
Maximum Division | 5 |
Tool Support
Tool | Compatibility | Version Used Here |
---|---|---|
Synplify Pro | Yes | - |
ModelSim | Yes | INTEL FPGA STARTER EDITION 10.5b, Revision: 2016.10, Date: Oct 5 2016 |
Quartus Prime | Yes | Ver 18.1.1 Build 646 04/11/2019 |
Vivado | No | Version 2019.1.1 (64-bit) |
Fuller details about tool support are already provided on Large Comparators Pipelined Efficiently by Recursion, explaining Vivado's limits, and a Quartus Prime annoyance coded around here with the minimum() function.
Conclusions
So why not just use Vivado's IP Catalogue for FIR Filters? Because its no fun. This was intended as a toy problem to be solved in VHDL. It would need a variety of different implementations in order to exploit symmetric coefficient sets and half band filters fully. So ultimately I will stick to the IP Catalogue, not least because Vivado does not fully or sufficiently support recursive VHDL. Whilst Vivado does support tail recursion that is really just iterative. So whilst the answer here is limited in applicability by some tools, if anyone asks if it is possible to build an efficient and pipelined adder tree solely in VHDL, then the answer is "yes"!