The whole point of creating the automatically self-constructing pipelined adder tree component was to instantiate FIR filters by just specifying VHDL generic parameters. In order to decide if the component has merit, it is necessary to compare it with other implementations. In this blog post I'll compare three different implementations of a FIR filter, the "traditional" with the adder tree, the "transpose" and "systolic" variants that are simple iterative pipelined arrays.
- Three FIR Filter Implementations
- Traditional
- Transpose
- Advantages:
- Disadvantages:
- Systolic
- Advantages:
- Disadvantages:
- Synthesis
- Results
- Conclusions
- References
Three FIR Filter Implementations
entity fir_filter_var_coeffs is
generic (
input_width_g : positive;
num_coeffs_g : positive
);
port (
clk : in std_logic;
reset : in std_logic;
coeffs : in input_arr_t(num_coeffs_g-1 downto 0)(input_width_g-1 downto 0);
data_in : in signed(input_width_g-1 downto 0);
data_out : out signed(output_bits(2*input_width_g, num_coeffs_g)-1 downto 0)
);
end entity;
I then dug out some course notes on implementing FIR filters in FPGA. The course I attended was "Essential DSP Implementation Techniques for Xilinx FPGAs" given by Doulos, but many other accredited suppliers are also available. There is also the option of generating an IP Core. Having reviewed the options required to generate a FIR filter from the IP Catalogue, I decided I could not be sure what I was getting from the code in order to provide a comparison and the core would be too inconvenient to update the parameters for on each test.
Each architecture aims to fully pipeline the design for maximum speed. That is, the adder tree is fully binary, adding at most two operands between register stages. The designs will be synthesised in Quartus Prime solely as Vivado cannot properly implement recursive components. The design will then be compared for speed, and the number of registers used to verify construction.
Traditional

My Xilinx course notes offer no advantages or disadvantages to the traditional (adder tree) FIR filter. I expect that the structure is simply not realised by their IP catalogue, possibly due to the lack of support for fully recursive VHDL. (It supports tail recursion only, which is really iterative.)
architecture traditional of fir_filter_var_coeffs is
signal data_reg : input_arr_t(num_coeffs_g-2 downto 0)(data_in'range);
signal mult_arr : input_arr_t(coeffs'range)((2*input_width_g)- 1 downto 0);
begin
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
data_reg <= (others => (others => '0'));
mult_arr <= (others => (others => '0'));
else
data_reg <= data_in & data_reg(data_reg'high downto 1);
for i in coeffs'range loop
if i = coeffs'high then -- num_coeffs_g-1
mult_arr(i) <= data_in * coeffs(i);
else
mult_arr(i) <= data_reg(i) * coeffs(i);
end if;
end loop;
end if;
end if;
end process;
adder_tree_pipe_i : entity work.adder_tree_pipe
generic map (
depth_g => ceil_log(coeffs'length, 2), -- Best attempt at a binary tree with single adder between register stages
num_coeffs_g => coeffs'length,
input_width_g => mult_arr(0)'length
)
port map (
clk => clk,
reset => reset,
i => mult_arr,
o => data_out
);
end architecture;
Transpose

Cribbing some of the course notes from Xilinx, the training suggests the following features of the Transpose FIR filter.
Advantages:
- A smooth data flow from left to right and removed layout "battle"
- Supports any number of taps without issues of non-symmetric adder trees. [There are none if and only if your synthesis tool supports recursion fully!]
- The input-to-output latency is reduced in a pipelined implementation. The maximum latency will never exceed the pipelining through the slice containing the first coefficient. This will be typically three clock cycles from data input until the results appears at the output.
- Efficient mapping to the DSP48 slice enabled by the adder chain structure of the transpose FIR filter. This adder chain is extendable to support large or small FIR filters.
- No slice logic external to the DSP48 is required, enabling the highest possible performance to be achieved.
- Simplified design methodology (tap-slice approach).
- [Mine] Realisable using Vivado which does not support recursion required for pipelined adder trees!
Disadvantages:
- Performance can be limited by a high fanout input signal if there are a large number of taps.
- Total number of bits within the adder tree structure is larger. [Not sure about this, same number of adders when drawn correctly, chain is deeper than tree, but its all pipelined so no matter. Is this about the growth in bit as the sum progresses, i.e. number of bits required to represent values without overflow?]
- [Mine] n-1 Fewer registers (for n coefficients) as the data shift registers effectively get combined with the adder pipelining.
architecture transpose of fir_filter_var_coeffs is
-- Just need to passing a defined array of arrays, with a guaranteed range for coeffs_c'range.
constant coeffs_c : input_arr_t(coeffs'length-1 downto 0)(input_width_g-1 downto 0) := (others => (others => '0'));
constant sum_width_c : integer_vector(coeffs'range) := calc_sum_width(coeffs_c, input_width_g);
signal mult_arr : input_arr_t(coeffs_c'range)(input_width_g + coeffs(0)'length - 1 downto 0);
signal sum : input_arr_t(coeffs_c'range)(output_bits(input_width_g + coeffs(0)'length, coeffs'length)-1 downto 0);
begin
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
mult_arr <= (others => (others => '0'));
-- Ensure some bits are sum are never touched and not just evapourate, but provably never get used in the filter logic.
for i in coeffs'range loop
sum(i)(sum_width_c(i)-1 downto 0) <= (others => '0');
end loop;
else
for i in coeffs'range loop
mult_arr(i) <= data_in * coeffs(coeffs'high-i);
if i = coeffs'high then -- num_coeffs_g-1
sum(i)(sum_width_c(i)-1 downto 0) <= resize(mult_arr(i), sum_width_c(i)); -- Add zero sum in
else
sum(i)(sum_width_c(i)-1 downto 0) <= resize(mult_arr(i), sum_width_c(i)) + resize(sum(i+1)(sum_width_c(i+1)-1 downto 0), sum_width_c(i));
end if;
end loop;
end if;
end if;
end process;
data_out <= sum(0);
end architecture;
Systolic

Cribbing some of the course notes from Xilinx, the training suggests the following features of the Systolic FIR filter.
Advantages:
- Highest Performance: Maximum performance can be achieved with this structure because there is no high fanout input signal. larger filters will be limited by routing if the number of coefficients exceed the number of DSP slices in a column on a device
- Efficient mapping to the DSP primitives: Enabled by the adder chain structure of the systolic FIR filter. This adder chain is extendable to support large or small FIR filters.
- No external logic: No external FPGA fabric is required, enabling the highest possible performance to be achieved.
- [Mine] Realisable using Vivado which does not support recursion required for pipelined adder trees!
Disadvantages:
- High latency: The latency of the filter is a function of the number of coefficients in the filter. The larger the filter, the higher the latency.
architecture systolic of fir_filter_var_coeffs is
-- Just need to passing a defined array of arrays, with a guaranteed range for coeffs_c'range.
-- Compiler error: 'calc_sum_width' can't be dependent on coeffs.
constant coeffs_c : input_arr_t(num_coeffs_g-1 downto 0)(input_width_g-1 downto 0) := (others => (others => '0'));
constant sum_width_c : integer_vector(coeffs_c'range) := calc_sum_width(coeffs_c, input_width_g);
signal data_reg : input_arr_t(num_coeffs_g-2 downto 0)(data_in'range);
signal data_pipe : input_arr_t(num_coeffs_g-2 downto 0)(data_in'range);
signal mult_arr : input_arr_t(coeffs'range)(input_width_g + coeffs_c(0)'length - 1 downto 0);
signal sum : input_arr_t(coeffs'range)(output_bits(input_width_g + coeffs(0)'length, coeffs'length)-1 downto 0);
begin
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
data_reg <= (others => (others => '0'));
data_pipe <= (others => (others => '0'));
mult_arr <= (others => (others => '0'));
-- Ensure some bits are sum are never touched and not just evapourate, but provably never get used in the filter logic.
for i in coeffs'range loop
sum(i)(sum_width_c(i)-1 downto 0) <= (others => '0');
end loop;
else
data_pipe <= data_in & data_reg(data_pipe'high downto 1);
data_reg <= data_pipe;
for i in coeffs'range loop
if i = coeffs'high then -- num_coeffs_g-1
mult_arr(i) <= data_in * coeffs(i);
sum(i)(sum_width_c(i)-1 downto 0) <= resize(mult_arr(i), sum_width_c(i)); -- Add zero sum in
else
mult_arr(i) <= data_reg(i) * coeffs(i);
sum(i)(sum_width_c(i)-1 downto 0) <= resize(mult_arr(i), sum_width_c(i)) + resize(sum(i+1)(sum_width_c(i+1)-1 downto 0), sum_width_c(i));
end if;
end loop;
end if;
end if;
end process;
data_out <= sum(0);
end architecture;
Synthesis
The following global assignments were used in Quartus Prime to turn off clever optimisation features in order to establish the underlying achievable clock speed of the described logic circuits.
set_global_assignment -name OPTIMIZATION_MODE BALANCED
set_global_assignment -name ALLOW_REGISTER_MERGING OFF
set_global_assignment -name ALLOW_REGISTER_DUPLICATION OFF
set_global_assignment -name ALLOW_REGISTER_RETIMING OFF
set_global_assignment -name REMOVE_DUPLICATE_REGISTERS OFF
set_global_assignment -name SYNTHESIS_EFFORT FAST
set_global_assignment -name DISABLE_REGISTER_MERGING_ACROSS_HIERARCHIES OFF
set_global_assignment -name ROUTER_LCELL_INSERTION_AND_LOGIC_DUPLICATION OFF
set_global_assignment -name ROUTER_REGISTER_DUPLICATION OFF
set_global_assignment -name ADVANCED_PHYSICAL_OPTIMIZATION OFF
The other settings required to reproduce the following results are:
Tool: | Quartus Prime Ver 18.1.1 |
---|---|
Device: | Cyclone V 5CGXFC7C7F23C8 |
Timing Model: | Slow 1100mV 85C Model Only |
Results
The following graph was prompted by an observation that for all three different architectures there was a positive jump in maximum clock frequency between 9 and 10 bits per operand (data_in and coefficients). I was unable to identify any clues from the technology logic view as to why the design has a better performance at 10 bits or above than 9 bits or below. Suggestions please.

Coefficients | 6 | 6 | 6 | 21 | 21 | 21 | 41 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Input Width | 8 | 11 | 14 | 8 | 11 | 14 | 14 | |||||||
Fmax (MHz) | Registers | Fmax (MHz) | Registers | Fmax (MHz) | Registers | Fmax (MHz) | Registers | Fmax (MHz) | Registers | Fmax (MHz) | Registers | Fmax (MHz) | Registers | |
Traditional | 124.89 | 249 | 157.68 | 339 | 153.66 | 429 | 125.28 | 982 | 151.93 | 1333 | 150.31 | 1684 | 150.51 | 3359 |
Transpose | 124.72 | 203 | 150.51 | 275 | 150.92 | 347 | 123.35 | 746 | 153.63 | 998 | 154.25 | 1250 | 148.48 | 2479 |
Systolic | 126.39 | 275 | 154.8 | 374 | 153.73 | 473 | 124.49 | 1058 | 150.33 | 1427 | 154.87 | 1796 | 150.69 | 3585 |
An observation made in passing with a former version of the VHDL code is that when using fixed coefficients, Quartus was able to immediately optimise away any multipliers with duplicate coefficient values in the Transpose architecture (only). This was not possible in the other two architectures due to the shift register for data_in. A rather neat benefit to the Transpose FIR filter.
Conclusions
Now I cribbed Xilinx course notes in order to build structures in Intel's Quartus Prime because I can't realise the Tradition FIR filter in Vivado. This does open up the possibility that claims made by Xilinx about the advantages and disadvantages not not necessarily map across to intel devices.
- On clock speed there is no clear winner, but Transpose looks like edging ahead slightly.
- On size, Transpose is the outright winner, which we would have predicted by looking at the diagrams of how they are constructed. Systolic is the outright loser. I note that the adder tree uses additional registers over the Transpose architecture in two places:
- The data shift register: here the Transpose architecture manages to merge those registers with the addition logic pipelining to save space. This can be simply verified by inspection of the architectures, but less obvious is the following:
- Un-balanced adder trees: When no adder is implemented at a node in the tree it must delay the single input to match delays all the same. The Transpose and Systolic architectures avoid this waste by being purely iterative.
When I look at Xilinx's suggested advantages and disadvantages of the different FIR filter architectures I observe:
- No evidence of "layout battle" with the Adder Tree through lower Fmax, but then the chosen devices are hardly congested.
- There are no issues associated with non-symmetric adder trees. These issues were not actually stated, but I assume they allude to the issues of constructing the adder tree with a non-power of 2 inputs which I have already solved.
- Latency of the results being emitted is self-evident.
- Efficient packing of logic into ALMs (CLBs in Xilinx) seems to be the same for all.
- The high fanout associated with the Transpose FIR filter seems to cause little in the way of practical timing issues.
Maximum clock speeds not being a decider, and the fanout of data_in for the Transpose FIR filter having little practical effect on timing results, always use a Transpose FIR filter. Keep it simple! If any of them were to be chosen as the "worst" performer it would be the Systolic FIR filter as slightly worse for timing in general and worst for logic utilisation. There is no advantage to being able to automatically create a pipelined adder tree for a FIR filter. It serves nothing more than an intellectual exercise, that is not routinely supported by synthesis tools.