Recursive structures in hardware description languages were not new when I first created this solution prior to 2004. This problem amused me because I wanted to create a structure in VHDL that Synplify Pro could not improve on to know that I had created something optimal. A simple example would be a recursive RAM, consuming a number of bits of address space with each level of recursion. But the point of recursion is to pipeline logic at each level to achieve a required clock speed. The example I have played with here is an \(n\)-bit comparator.
- Principles
- Basic Recursive Structure
- Balanced Tree Structures
- When LUTs have an odd number of inputs
- Coping with odd and even LUT inputs
- Preventing Logic Bloat
- Coping with odd and even LUT inputs
- Choosing The Divider
- Construction Results
- Tool Support
- Other Projects
- References
Principles
- The designer must be able to specify the delay through the instantiation in order to be able to match the delay in any parallel logic.
- The design should cope with the full range of realistic functional parameters, i.e. widths should not be constrained to a helpful subset of values like powers of 2.
- The design should be optimal in size for a given depth of pipeline requested.
- The design must be able to reach maximal clock speeds, such that the designer may trade speed for size.
- The design should behave well under non-ideal situations. For example, if more clock cycles are requested than are actually required for maximum clock speed, the pipeline should be padded efficiently and avoid pathological logic bloat.
Basic Recursive Structure
As for any recursive function, the following code covers two essential parts, terminating the recursion with RTL logic, and "calling itself" which here means instantiating multiple copies of itself but each one for a smaller part of the problem, i.e. divide and conquer the initial problem. "Smaller" here means decrementing the value of depth on each subsequent instantiation, representing a level lower in an \(n\)-ary tree. When depth = 1 we should be at a leaf node and hence the RTL is generated terminating the recursion. These recursive structures are synthesisable because their depth is known at compile time.

library ieee;
use ieee.std_logic_1164.all;
entity comparator is
generic(
depth_g : positive := 3; -- pipeline depth required
data_width_g : positive := 32; -- data bus width required for data_a and data_b.
-- number of inputs to a LUT in an FPGA, or for an ASIC, how many bits it is
-- reasonable to operate on in a single clock cycle. Can be odd number.
lutsize_g : positive range 2 to positive'high := 4
);
port(
clk : in std_ulogic;
reset : in std_ulogic;
data_a : in std_ulogic_vector(data_width_g-1 downto 0);
data_b : in std_ulogic_vector(data_width_g-1 downto 0);
equal : out std_ulogic
);
end entity;
use work.comp_pkg.all;
architecture rtl of comparator is
-- Not used in the building of the leaf node, more for visibility, and consistency with
-- the 'else' part. The testbench can extract the values from this constant.
constant rc_c : divide_item_t := recurse_divide(depth_g, data_width_g, lutsize_g);
begin
-- 'depth_g' is descending from its top level (initial) value down to 1. When depth_g = 1 we
-- terminate the recursion and implementing a 'leaf node' to implement the comparison
-- logic.
g : if depth_g = 1 generate
begin
leaf : process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
equal <= '0';
else
if data_a = data_b then
equal <= '1';
else
equal <= '0';
end if;
end if;
end if;
end process;
else generate
subtype equal_t is std_ulogic_vector(0 to rc_c.divide-1);
signal equal_in : equal_t;
begin
-- Divide the bus up and recurse on smaller sections. 'recurse_divide' works out how much
-- to divide by at each level of recursion. This includes determining if there are more
-- pipeline stages than is actually required. If so, save the work until later as it saves
-- on 'area' by packing LUTs more efficiently, so just add a flop for delay and tail recurse
-- only (no division).
recurse : for i in equal_t'range generate
-- There are inconsistencies between the libraries used for simulation in ModelSim and
-- synthesis in Quartus Prime.
--
-- 'minimum' has not been implemented for 'positive' in some tools! Where it has been
-- implemented, the function's presence causes ambiguity. Helpful...
--
-- Quartus Prime:
-- Error (10482): VHDL error at comparator.vhdl(85): object "minimum" is used but not declared
-- Error: Quartus Prime Analysis & Synthesis was unsuccessful. 1 error, 0 warnings
--
-- ModelSim: ** Error: A:/Philip/Work/VHDL/Comparator/comparator.vhdl(89): Subprogram "minimum" is ambiguous.
--
-- Disambiguate with explicity call of 'work.comp_pkg.minimum'.
-- Divide up the data buses for the i'th sibling of recursion and setup the generic
-- values for the next level in the hierarchy.
constant upper_c : positive := work.comp_pkg.minimum((i+1)*rc_c.maxwidth, data_width_g);
constant buslow_c : natural := rc_c.maxwidth*i;
constant bushigh_c : natural := upper_c-1;
constant recwidth_c : positive := upper_c-buslow_c;
begin
comparator_c : entity work.comparator
generic map(
depth_g => depth_g-1,
data_width_g => recwidth_c,
lutsize_g => lutsize_g
)
port map(
clk => clk,
reset => reset,
data_a => data_a(bushigh_c downto buslow_c),
data_b => data_b(bushigh_c downto buslow_c),
equal => equal_in(i)
);
end generate;
combine : process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
equal <= '0';
else
-- 'rc_c.divide' input AND gate.
if equal_in = equal_t'(others => '1') then
equal <= '1';
else
equal <= '0';
end if;
end if;
end if;
end process;
end generate;
end architecture;
The clever part of the code is hidden away in a package as it requires some explanation and is non-trivial to explain. There will also be some perceived duplication of the minimum function, but that will be covered in the section on tool support.
Balanced Tree Structures
At a macro view, our comparator can be broken down into an \(n\)-ary tree. For a tree with \(w\) bits to compare, you will immediately realise the tree depth is proportional to \(log(w)\), which will need to snap to an integer number of recursion levels or clock cycles to a pipeline depth of \(d\). As we're specifying the tree's depth we have to work out what \(n\) is, the amount by which we divide at each level of recursion. Remember that leaf nodes will compare two data buses, so will have \(2w\) bits of work to do.
\(d\) = tree depth, or number of clock cycles in pipeline
\(w\) = number of bits in each of two input vectors to compare
\(d = \big \lceil \log_n(2w) \big \rceil\)
\(n = \big \lceil \sqrt[d]{2w} \big \rceil\)
So that: \((n-1)^d < 2w \leq n^d\)

This \(n\) is larger than required as it has been rounded up, but it is at least sufficient for the job, and cannot be smaller. There's a logic size issue created by this value when spread evenly amongst the tree. Whilst this value is initially useful in creating a working recursive structure, it is not necessarily efficient in size for the function. Efficiency is achieve when the inputs to the Look Up Tables (LUTs) in an FPGA are all used, with a minimum of unused LUT inputs. As it happens the above illustration remains efficient when using 4-input LUTs, but it is not when using 6-input LUTs. The 6-input LUTs are more efficiently packed as illustrated below. This feels like a very FPGA specific problem, and more work would be required to understand what the equivalent is for an ASIC.

Now we must convert the problem to one in which LUTs are 'packed out'. Older Xilinx FPGAs pre-dating Virtex-5 had 4-input LUTs, since Virtex-5 LUTs have 6-input LUTs, and the number of inputs to a LUT now becomes a significant factor. Every few LUTs we'll insert a register to pipeline the comparison. So the maximum clock speed is achieved when there is no more than one LUT between any pair of registers when ascending the comparator tree.
Between pairs of registers we consider a "LUT tree" that can combine \(lut\_size\), or \(lut\_size^2\), or \(lut\_size^3\)… and so on inputs. Typically we're looking at a LUT tree depth of 1 or 2, so packing the number of inputs consumed into \(lut\_size\) or \(lut\_size^2\), but we never constrain our solution to these practical sizes. So now we calculate what minimum depth of LUTs is required between registers in order to cover the complete pair of buses to be compared. For \(d\) levels of pipelining of \(2w\) inputs, calculate the LUT depth for the whole tree and divide by the number of levels of pipelining that the comparison work will be spread over.
\[lut\_depth = \bigg \lceil {{log_{lut\_size}(2w)} \over d} \bigg \rceil\]This calculation leaves leaf nodes room to compare pairs of inputs. In my initial solution I failed to account for this and found unexpected results with doubling of the LUT depth in the leaf nodes, slowing the achievable maximum clock speed. Note we have not dealt with the case of an odd value of \(lut\_size\) which does add a complication for leaf nodes
When LUTs have an odd number of inputs
This is a moot point really since I am unaware of any FPGA technologies with an odd number of inputs to their LUTs. However, because an odd number can be specified by the generic and this post is mostly a toy problem anyway, I would rather understand how to make the construction work than assert a failure odd \(lut\_size\).
For an odd number of LUT inputs, at leaf nodes one LUT input cannot be used. For example, a fictitious 5-input LUT can only compare 2 pairs of bits from the input vector. We need to account for the unused input when calculating the depth of the LUT trees. The table below illustrates the adjustment ratio required.
LUT inputs, \(lut\_size\) | Pairs of input bits compared | Usage Ratio | Equation |
---|---|---|---|
4 | 2 | 2 | 2 |
5 | 2 | \(\frac {5} {2}\) | \(\frac {2 \times lut\_size} {lut\_size-1}\) |
6 | 3 | 2 | 2 |
7 | 3 | \(\frac {7} {3}\) | \(\frac {2 \times lut\_size} {lut\_size-1}\) |
The table shows us that for all even-input LUTs the ratio is bits compared to LUT inputs is 2. But for odd-input LUTs we get a variety of different ratios. For odd-input LUTs we need to use the following equation for \(lut\_depth\).
\[lut\_depth = \Bigg \lceil {{log_{lut\_size} \left( {{2w \, \times \, lut\_size} \over {lut\_size-1}} \right)} \over d} \Bigg \rceil\]Coping with odd and even LUT inputs
A simple condition can test which of the two equations to use, but we can combine them into one as follows, depending on which you consider most aesthetically pleasing. With even \(lut\_size\) then \((lut\_size \, mod \, 2) = 0\), hence the following equation reduces to using the original factor of '2'.
\[lut\_depth = \Bigg \lceil {{log_{lut\_size} \left( {{2w \, \times \, lut\_size} \over {lut\_size - (lut\_size \, \bmod \, 2)}} \right)} \over d} \Bigg \rceil\]Giving us the following VHDL function for the depth of a LUT tree:
function lut_depth(
constant depth : positive;
constant width : positive;
constant lutsize : positive := 4
) return natural is
begin
return natural(ceil(
log(
real(width * 2 * lutsize / (lutsize - (lutsize mod 2))),
real(lutsize)
) / real(depth)
));
end function;
We can then determine the amount of division at each level of recursion based on the remaining comparison width to be managed, \(w\), and the remaining number of levels, \(d\). The maximum number of inputs that can be consumed by this and remaining levels of hierarchy is given by:
\[lut\_inputs = \left(lut\_size ^ {lut\_depth}\right) ^ d\]So for the \((d-1)\) levels of hierarchy underneath we calculate their maximum comparison width (pairs of bits) as:
\[max\_width = {lut\_size^{(d-1)lut\_depth} \over 2}\]Now we know the \(max\_width\) of each level of recursion, we can calculate the division factor at this level simply by:
\[n = \Big \lceil {w \over {max\_width}} \Big \rceil\]This is only in ideal cases where the designer does not request more pipelining that is necessary. So now we must deal with another situation where the structure might realise an inefficient structure.
Preventing Logic Bloat
The LUT trees are the most important bloat prevention method that are required for any efficient implementation, but a second case exists when the number of pipeline stages requested is excessively large. We desire the padding registers to be applied to the comparator's single bit output and never to the comparator's many inputs.

The solution requires that we know before we recurse down the tree whether or not we are going to divide by anything greater than 1. Dividing by '1' just add a register for a pipeline stage without adding any logic (LUTs). The question we have to answer now is, can I realise the structure no less efficiently by doing no work now? If so, divide by 1, i.e. pad the pipeline, and recurse and perform the test again.
\[With \; ld = lut\_depth(d, w, lut\_size),\] \[ recurse\_divide = \begin{cases} w & d = 1 \\ 1 & ld = lut\_depth(d-1, w, lut\_size) \\ ld & otherwise \end{cases} \]Coping with odd and even LUT inputs
Again we need to make adjustments when we have an odd-input LUT.
Depth, \(d\) | \(max\_width\) |
---|---|
1 | \((lutsize-1) / 2\) |
2 | \(lutsize \times (lutsize-1) / 2\) |
3 | \(lutsize^2 \times (lutsize-1) / 2\) |
In order to adjust the existing equation for even-input LUTs to accommodate odd-input LUTs, the existing equation needs the subtraction of a quantity that provides the equivalent of the values in the above table. So for odd-input LUTs we restate the equation with the subtraction factor.
\[ max\_width = {{ lut\_size ^ {(d-1)lut\_depth} - lut\_size ^ {\left((d-1)lut\_depth-1\right)} } \over 2} \]Then in the VHDL we again employ the \(lut\_size \, mod \, 2\) multiplication factor to switch the subtraction factor on or off.
Choosing The Divider
For calculation efficiency, return multiple values from this function call. Otherwise the same calculations are made repeatedly and unnecessarily in the separate functions.
function recurse_divide(
constant depth : positive;
constant width : positive;
constant lutsize : positive := 4
) return divide_item_t is
variable maxwidth : positive;
variable lutdepth : natural;
begin
if depth = 1 then
return divide_item_t'(
width,
-- Leaf nodes have two input buses, so don't forget to double the logic work.
width * 2,
-- lut_depth() takes into account the doubling of width in leaf nodes.
lut_depth(1, width, lutsize)
);
else
lutdepth := lut_depth(depth, width, lutsize);
-- If delaying the work for one level of hierarchy does not adversely affect the LUT
-- depth, do no work this time.
if lutdepth = lut_depth(depth-1, width, lutsize) then
return divide_item_t'(1, width, 0);
else
maxwidth := (
-- (lutsize ** lutdepth) bits per level, for depth-1 levels
(lutsize ** (lutdepth * (depth-1))) -
-- For odd 'lutsize' only subtract the LUT inputs that can't be connected at leaf nodes.
( (lutsize mod 2) * (lutsize ** ((lutdepth * (depth-1))-1)) )
) / 2;
return divide_item_t'(
positive(ceil( real(width) / real(maxwidth) )),
maxwidth,
lutdepth
);
end if;
end if;
end function;
This concludes the derivation of the mathematical problem, and the construction of the most size efficient design for a given requested depth of pipelining. Naturally the code will be factored into a package like the one I indicate below. I'll now move on to some testing and results.
package comp_pkg is
type divide_item_t is record
divide : positive;
maxwidth : positive;
lutdepth : natural;
end record;
-- See tool support below
function minimum(a, b : positive) return positive;
function lut_depth(
constant depth : positive;
constant width : positive;
constant lutsize : positive := 4
) return natural;
function recurse_divide(
constant depth : positive;
constant width : positive;
constant lutsize : positive := 4
) return divide_item_t;
end package;
Construction Results
The test bench is able to verify the construction is as expected as well as throw walking '1' test vectors at the inputs to compare. The walking '1's are essential to prove that every bit is actually used and none are omitted. The tests use internal values from the construction to compare against expected values from quantities calculated externally to the DUT based on the generics supplied to each DUT. Multiple DUTs are constructed in parallel for testing. The test bench allows for easily adding a new DUT to an array of parameter values. Therefore you can add a test for precisely the values you need without having to test all values in all dimensions which is impractical. The following two figures partially show the results of construction for two DUTs. Full expansion of the structure is unweildy to view.



The test results are awkward to tabulate, so the following table requires an explanation.
- The test parameters for DUT x are in bold, namely Pipeline depth, Compare Width, and LUT Size. These 3-tuples are held in an array of DUT tests, and passed to the generics of each generated DUT. The table below lists the results for 11 such DUTs
- When each DUT is instantiated, the VHDL-2008 version of signal spies are used to extract the divisors and LUT depth used at each level of hierarchy down the first and most packed path. These are the figures listed in a 'sub-table' in the last four columns under "Statistics for top path of recursion of the tree where logic is most densely packed.".
- The "Max LUT Depth" value is then quite literally the maximum value from the Divide column. This is the largest LUT depth between two registers, related to the minimum clock period the design will clock at.
- The "Expected LUT depth" is calculated from the initial DUT generics, and represents an easy calculation against which to compare "Max LUT Depth" from construction. This comparison proved extremely worthwhile in verifying a variety of different generic values and found several issues of mathematical complexity.
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
---|---|---|---|---|---|
DUT | 0 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 2 | 2 | 3 | 8 | 2 |
Compare Width | 23 | 1 | 8 | 16 | 2 |
LUT Size | 4 | ||||
Max LUT Depth | 2 | ||||
Expected LUT depth | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 1 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 5 | 5 | 1 | 49 | 0 |
Compare Width | 49 | 4 | 1 | 49 | 0 |
LUT Size | 6 | 3 | 3 | 18 | 1 |
Max LUT Depth | 1 | 2 | 6 | 3 | 1 |
Expected LUT depth | 1 | 1 | 3 | 6 | 1 |
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 2 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 3 | 3 | 6 | 18 | 1 |
Compare Width | 101 | 2 | 6 | 3 | 1 |
LUT Size | 6 | 1 | 3 | 6 | 1 |
Max LUT Depth | 1 | ||||
Expected LUT depth | 1 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 3 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 2 | 2 | 14 | 9 | 3 |
Compare Width | 125 | 1 | 9 | 18 | 3 |
LUT Size | 3 | ||||
Max LUT Depth | 3 | ||||
Expected LUT depth | 3 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 4 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 3 | 3 | 1 | 50 | 0 |
Compare Width | 50 | 2 | 5 | 10 | 2 |
LUT Size | 5 | 1 | 10 | 20 | 2 |
Max LUT Depth | 2 | ||||
Expected LUT depth | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 5 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 2 | 2 | 8 | 32 | 3 |
Compare Width | 237 | 1 | 32 | 64 | 3 |
LUT Size | 4 | ||||
Max LUT Depth | 3 | ||||
Expected LUT depth | 3 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 6 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 3 | 3 | 3 | 648 | 2 |
Compare Width | 1445 | 2 | 36 | 18 | 2 |
LUT Size | 6 | 1 | 18 | 36 | 2 |
Max LUT Depth | 2 | ||||
Expected LUT depth | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 7 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 3 | 3 | 6 | 250 | 2 |
Compare Width | 1445 | 2 | 25 | 10 | 2 |
LUT Size | 5 | 1 | 10 | 20 | 2 |
Max LUT Depth | 2 | ||||
Expected LUT depth | 2 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 8 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 6 | 6 | 3 | 512 | 1 |
Compare Width | 1445 | 5 | 4 | 128 | 1 |
LUT Size | 4 | 4 | 4 | 32 | 1 |
Max LUT Depth | 1 | 3 | 4 | 8 | 1 |
Expected LUT depth | 1 | 2 | 4 | 2 | 1 |
1 | 2 | 4 | 1 | ||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 9 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 3 | 3 | 10 | 2048 | 3 |
Compare Width | 20000 | 2 | 64 | 32 | 3 |
LUT Size | 4 | 1 | 32 | 64 | 3 |
Max LUT Depth | 3 | ||||
Expected LUT depth | 3 | ||||
Test Parameters | Statistics for top path of recursion of the tree where logic is most densely packed. | ||||
DUT | 10 | Depth | Divide | Max Width | LUT Depth |
Pipeline depth | 2 | 2 | 157 | 128 | 4 |
Compare Width | 20000 | 1 | 128 | 256 | 4 |
LUT Size | 4 | ||||
Max LUT Depth | 4 | ||||
Expected LUT depth | 4 |
For each of the 11 DUTs we can verify construction by "Expected LUT depth" equals "Max LUT Depth". Originally this was tested with Synplify Pro 7.x way back before 2004 (actually they fixed their software when I pointed out these were synthesisable RTL structures), and there were only highly specialist optimisations that could improve the structure that I could not readily predict.
Tool Support
Tool | Compatibility | Version Used Here |
---|---|---|
Synplify Pro | Yes | - |
ModelSim | Yes | INTEL FPGA STARTER EDITION 10.5b, Revision: 2016.10, Date: Oct 5 2016 |
Quartus Prime | Yes | Ver 18.1.1 Build 646 04/11/2019 |
Vivado | No | Version 2019.1.1 (64-bit) |
ERROR: [Synth 8-317] illegal recursive instantiation of design 'comparator' [.../comparator.vhdl:xxx] ERROR: [Synth 8-285] failed synthesizing module 'comparator' [.../comparator.vhdl:xx]
Xilinx Vivado only supports tail recursion, which is really just iteration, it does not support n-ary trees which is the more general type of recursion. Use Quartus Prime or Synplify Pro instead.
There are inconsistencies between the libraries used for simulation in ModelSim and synthesis in Quartus Prime. 'minimum' has not been implemented for 'positive' in some tools! Where it has been implemented, the function's presence causes ambiguity. Hence to avoid ambiguity I implemented one locally to make sure.
Error (10482): VHDL error at comparator.vhdl(85): object "minimum" is used but not declared Error: Quartus Prime Analysis & Synthesis was unsuccessful. 1 error, 0 warnings
ModelSim: ** Error: A:/Philip/Work/VHDL/Comparator/comparator.vhdl(89): Subprogram "minimum" is ambiguous. Disambiguate with explicity call of 'work.comp_pkg.minimum'.
Vivado's lack of support for recursion is a frustration, but I'm asked "exactly how often do you actually need it?" Well it spoils the fun.
Other Projects
- Large RAM, e.g. knock off n-bits of address at each level of hierarchy and pipeline the demultiplexing of data buses. There are likely to be better ways of doing this in the last few years.
- Large demultiplexer, e.g. knock off n-bits of selection at each level of hierarchy. It required a different approach, but I no longer have the code. Potential use in extracting consecutive characters from a long string at a random offset.
- Adder Tree, e.g. for a FIR filter in DSP.