There's no good reason for doing this really, other than:
- Its just interesting to play with the challenge. And as usual it presented an issue that needed to be solved.
- Its the notably absent component in the genre of problems solved by recursion.
- You never know when you might need something like this...
- Synthesied Structures
- Results
- Naïve Method
- Early Work Method
- Maximum LUT Efficiency
- Predicted Results
- Conclusions
- References
Synthesied Structures
The code is too involved to be usefully provided here, instead see the GitHub source tree. The following elaborated design demonstrates the recursive implementation dividing the problem at each stage and instantiating a smaller version of itself multiple times until all the selection bits are used. Each level of recursion ends with a register for a stage of pipelining. The division of work here is the same as the Barrel Shift component. As usual, the case of having more clock cycles than is really needed must also be catered for. The elaboration results are taken from Intel's Quartus Prime since Xilinx's Vivado cannot cope. Only now, the VHDL code has to be churned as the free edition of Quartus Prime does not really support VHDL-2008.

The challenge thrown up by this component was the construction of the delays on the selection bits. As each level of selection in the tree peels off just the bits it needs for the current stage of the pipeline, it passes unused bits down the tree. These selection bits need to be delayed to match the data path. If the delay was performed inside the multiplexer tree, the selection bits would be duplicated at each level of recursion. This means the selection bit delays are best done separately and hence only once. Hence, the top level of the multiplexer instantiates two recursive components. The first is the multiplexer tree performing the selection, which assumes the selection bits are now appropriately delayed. The second component is a tail recursive delay that has to match the functional timing in which bits are required to be used.

These delays are in the opposite order to the multiplexer stages. So, if the top level of the multiplexer tree selects on the top bit(s) of sel after all the remaining recursion, the most significant selection bits must be delayed the most, and the least significant selection bits delayed the least, or not at all. The indexes of the delayed selection bits are now taken in the order of work from the bottom of the tree to the top. This means knowing how the tree will be constructed in full. Functions can be constructed to provide these bit indexes as an array, which is then shortened by one element with each tail recursive instantiation of the selection delays (mux_sel).

Here are two examples of the full multiplexer component construction by recursion.

Based on a left to right traverse across the design, the selection bits are used in the order sel(0), sel(1) and sel(3:2), but the recursion actually goes right to left and hence rips the top two bits first, the recurses 4 times. In the top path the next level of recursion uses 1-bit of selection (sel(1)) and recurses twice for the last bit of selection on the least significant bit that selects one of two inputs without recursion. This means the selection bit are ripped off and delayed in the sequence 1, 1, 2. We need to know the first level of recursion will choose 2 bits, then 1 and then 1 bit and reverse the sequence for the selection bit delays.

Another example with twice as many inputs and hence one more bit of selection.
Results
Synthesis Tool | Quartus Prime 23.1std.0 Build 991 11/28/2023 SC Lite Edition |
Project Part | Cyclone V 5CGXFC9E7F35C8 |
Timing Model | Slow 1100mV 0C Model |
I have found Quartus Prime less familiar and slightly more awkward for scripting as the tools seem to be segregated. So the results are gathered manually and hence there are fewer. For the purpose of getting an accurate fmax, all inputs and outputs were registered and false paths set on the external connections. These registers were removed to get the ALM and register primitive counts.
Naïve Method
The first set of results is derived from a function that ensures the maximum work in each pipeline stage never exceeds the worst case work. This tends to pick off more work in the early stages of recursion which equates to later stages of the pipeline.
function num_bits(
sel_len : natural;
num_clks : positive
) return natural is
begin
if num_clks > sel_len then
return 0;
else
return local.math_pkg.int_ceil_div(sel_len, num_clks);
end if;
end function;
Pipeline stage | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Numer of Pipeline Stages | Max bits per pipeline stage | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Sum |
1 | 8 | 8 | 8 | ||||||||
2 | 4 | 4 | 4 | 8 | |||||||
3 | 3 | 3 | 3 | 2 | 8 | ||||||
4 | 2 | 2 | 2 | 2 | 2 | 8 | |||||
5 | 2 | 2 | 2 | 2 | 1 | 1 | 8 | ||||
6 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 8 | |||
7 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 8 | ||
8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 | |
9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 8 |

Clock fequency improves with a reduction in the number of selection bits work per clock cycle until 2-bits of work is reached (8 bits over 4 clock cycles), and then flattens out as all the solutions have precisely one ALM between registers from that point onwards. Hence the clock speed hovers around the maximum achievable. Also from that point onwards, an increase in pipelining clock cycles (5+ pipeline stages) just increases the register count for no gain in clock speed which suggests a less efficient implementation based on the number of primitives, and the ALMs become less efficiently packed, hence slowly but steadily increase. This also suggests the optimum speed is achieved with 2 selection bits of work per clock cycle, i.e. num_clks_g = sel_bits_g / 2. Any larger and registers are simply wasted.
Early Work Method
The second set of results addresses the growth in registers once the maximum clock frequency has already been attained.
function num_bits(
sel_len : natural;
num_clks : positive
) return natural is
constant max : natural := local.math_pkg.int_ceil_div(sel_len, num_clks);
variable ret : natural := max;
begin
if num_clks > 1 then
-- Don't do now what you can leave for later
while ret > 0 and local.math_pkg.int_ceil_div(sel_len-ret+1, num_clks-1) = max loop
ret := ret - 1;
end loop;
end if;
return ret;
end function;
Pipeline stage | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Numer of Pipeline Stages | Max bits per pipeline stage | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Sum |
1 | 8 | 8 | 8 | ||||||||
2 | 4 | 4 | 4 | 8 | |||||||
3 | 3 | 2 | 3 | 3 | 8 | ||||||
4 | 2 | 2 | 2 | 2 | 2 | 8 | |||||
5 | 2 | 0 | 2 | 2 | 2 | 2 | 8 | ||||
6 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 8 | |||
7 | 2 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 8 | ||
8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 | |
9 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 |

A partially successful attempt since register growth is held off until pipelining over 8 cycles when the number of selection bits per clock cycle is 1.
Maximum LUT Efficiency
The third set of results addresses the optimum number of selection bits per clock cycle. More recent Xilinx devices have 6 input LUTs and a standrad one bit MUX requires 3 inputs. This suggests that to maximise LUT usage, 2 bits of selection must be done per clock cycle. i.e. if doing any selection, try to make it a multiple of 2 bits for maximum LUT efficiancy. Intel (was Altera) have claimed to be clever with an "8-input fracturable look-up table (LUT) with the Stratix® II family in 2004" (Ref: FPGA Architecture White Paper). These results were gained using a Cyclone V device dated from 2010. The LOGIC_CELL_COMB primitive clearly has 6 inputs when inspecting the technology viewer. Nevertheless, a fudge factor can be introduced to allow for different numbers of LUT inputs, where the developer can specify the most efficient packing of LUTs as a number of selection bits per LUT. Here we have used 2, so that an integer division is now rounded up to a multiple of 2, so that the division of work will favour 2, 4, 6 etc selection bits.
-- A single bit multiplexer has 3 inputs. A 6 input LUT can achieve two bits of selection at
-- no extra cost over 1-bit. Set 'sel_min' to (lutsize / 3) where 'lutsize' = number of
-- inputs in a device's LUT primitive.
--
function num_bits(
sel_len : natural;
num_clks : positive;
sel_min : positive := 2 -- Typically LUT inputs / 3
) return natural is
constant max : natural := local.math_pkg.int_ceil_div(sel_len, num_clks, sel_min);
variable ret : natural := max;
begin
if num_clks > 1 then
-- Don't do now what you can leave for later
while ret > 0 and local.math_pkg.int_ceil_div(sel_len-ret+1, num_clks-1, sel_min) = max loop
ret := ret - 1;
end loop;
return ret;
else
return local.math_pkg.int_ceil_div(sel_len, num_clks);
end if;
end function;
Pipeline stage | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Numer of Pipeline Stages | Max bits per pipeline stage | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Sum |
1 | 8 | 8 | 8 | ||||||||
2 | 4 | 4 | 4 | 8 | |||||||
3 | 3 | 0 | 4 | 4 | 8 | ||||||
4 | 2 | 2 | 2 | 2 | 2 | 8 | |||||
5 | 2 | 0 | 2 | 2 | 2 | 2 | 8 | ||||
6 | 2 | 0 | 0 | 2 | 2 | 2 | 2 | 8 | |||
7 | 2 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 8 | ||
8 | 1 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 8 | |
9 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 8 |

The LUT and register growth now flattens out at the same point in pipeline depth as the maximum clock frequency flattens. This splitting function achieves both user selectable speed as well as the most efficient implementation.
Predicted Results
The number of ALMs (or LUTs) to expect can be calculated from a tree of ALMs where each ALM performs 2 bits of work as illustarted below.

Here the pipeline registers can be inserted between any column of ALMs as determined by the num_bits() function. This example shows how 8-bit of selection can be picked off 2-bits at a time to perform the work in a single ALM, so the result is a 4-ary tree. The sum of the ALMs is 85 per bit of data. The example results are for 2-bit data, hence we would expect 170 ALMs in each results as the most efficient solution. The results here are coming in at ~176 ± 3.
Predicting the number of registers is also possible by summing the registers in the tail recursive selection bit delays with those expected for the recursive tree. This is more involved and not presented here, but in short the number of registers required should be one per recursively instantiated block per data bit. This includes the instantiated tail recursive padding stages which do no selection work. The synthesis results were no more than 4 registers larger than predicted.
The netlist viewer for the technology map appears to be perfect in construction. When viewing the netlist for the technology map post fitting there has been register duplication in the selection bit delays which will be to manage the fanout of the selection bits to large numbers of ALMs. The differences between expected and measured can be explained.
Conclusions
As always, the automatically scaling and pipelining of these structure allows the designer to consider the solution space without the grief of re-implementing the component optimally for the new scenario. Each recursive component has proven tricky to construct for some reason, and solving the problem on work time would probably be considered a technical vanity project! Here the difficulty was caused by the need to delay the selection bits in parallel with the divide and conquer of the multiplexer logic, as the construction of the parallel pipelines was quite different.
References
- Github Source Code
- Other blogs on recursive components