FIR Filter Experiment with Vitis HLS

Posted by on 03 May 2026 in FPGA, VHDL, C

Having completed First Attempt at High Level Synthesis, I thought I would explore the Vitis HLS tool with something slightly larger, a FIR filter. There is a C++ library for realising FIR filters, but it is dressed up in lots of #ifndef directives so that it is too obfuscated for me, and I wanted something simple to play with. I created a 64 coefficient FIR filter (#define ARR_SIZE 64) using the following code.

#include "fir.h"

int fir (int a[], int coeffs[], unsigned int len) {
    int b = 0;

    for(int i = 0; i < len; i++) {
        b += a[i] * coeffs[i];
    }

    return b;
}

int fir_fixed (int a[ARR_SIZE], int c[ARR_SIZE]) {
    return fir(a, c, ARR_SIZE);
}

C code for a FIR filter.

Default Synthesis

Running this through Vitis with the default settings gives a memory interface to an external RAM such that the design would sequentially read each FIR filter coefficient and data pair, multiply and accumulate.

FIR Filter Experiment with Vitis HLS — Elaborated Design in Vivado

It uses one multiplier (red), an adder (pink) for the accumulator and another adder (green) for the address increment.

The data path is two registers deep, but the output is unregistered.

I am confused by the implication that the number of cycles required in each loop is 4, when the data path is 2 registers deep. The Initiation Interval is 1, so the data is fed every clock cycle without stalling

Exploring the Design Space

Let's amend the implementation to read 8 pairs of coefficient and data in parallel at the same time. I would expect a good implementation of the code to look something like the follow:

The code was amended with pragmas as follows.

#include "fir.h"

int fir (int a[], int coeffs[], unsigned int len) {
    int b = 0;

    #pragma HLS ARRAY_PARTITION variable=a dim=1 factor=8 type=block
    #pragma HLS ARRAY_PARTITION variable=coeffs dim=1 factor=8 type=block
    for(int i = 0; i < len; i++) {
        #pragma HLS UNROLL
        b += a[i] * coeffs[i];
    }

    return b;
}

int fir_fixed (int a[ARR_SIZE], int c[ARR_SIZE]) {
    return fir(a, c, ARR_SIZE);
}

C code for a FIR filter with unrolled loop.

The unrolled loop has 8 multipliers (red) feeding an adder tree (pink) which has been pipelined. The data path below shows 5 registers.

The schedule viewer shows work is happening in parallel, but also show a timing violation.

The summary results reports that timing has been met (critical path of 8.2 ns) and there is a timing violation with negative slack of -0.65 ns. Which is it? The loop latency is 11 cycles, but the data path is only 5 registers deep.

"A timing violation is a path of operations requiring more time than the available clock cycle. To visualize this, the problematic operation is represented in the Schedule Viewer in a red box."

Vitis High-Level Synthesis User Guide (UG1399)

I'm finding this hard to believe at present. The tooling does provide the means to identify the line of source code, but I only have one of real interest.

"Solving timing violations in Vitis HLS using the Schedule Viewer involves identifying critical paths that exceed the clock period and implementing code-level or directive-based optimizations to break up long logic chains."

Google Artificial General Intelligence (AGI)

Some old documentation hinted at a "guidance window", but that appears to be for the old Eclipsed-based tool and I could not find any context sensitive help in Vitis version 2025.2.1. I decided to trial an additional directive to fix the timing violation.

Pipelining

I added a PIPELINE directive to the C code as follows. Note the PIPELINE directive also infers UNROLL.

#include "fir.h"

int fir (int a[], int coeffs[], unsigned int len) {
    int b = 0;

    #pragma HLS ARRAY_PARTITION variable=a dim=1 factor=8 type=block
    #pragma HLS ARRAY_PARTITION variable=coeffs dim=1 factor=8 type=block
    #pragma HLS PIPELINE
    for(int i = 0; i < len; i++) {
        b += a[i] * coeffs[i];
    }

    return b;
}

int fir_fixed (int a[ARR_SIZE], int c[ARR_SIZE]) {
    return fir(a, c, ARR_SIZE);
}

The pipelined loop has 8 multipliers (red) feeding an adder tree (pink) as before. This data path below also shows 5 registers. The schedule viewer shows the timing violation has been resolved.

Results

NB. Nothing has been verified in RTL simulation as that would require coding for the control logic added to the function parameters and return values.

The adder trees created by Vitis are incomplete. This indicates an inefficient use of logic as it incurs a greater use of registers in the final design.

Results for comparison.
Tool	Settings	Expt. 1	Expt. 2	Expt. 3	Expected	Units
Vitis HLS	ARRAY_PARTITION	1	8	8	8
Vitis HLS	UNROLL	0	1	Inferred	Inferred
Vitis HLS	PIPELINE	0	0	1	1
Vitis HLS	Requested Clock Frequency	100	100	100		MHz

	Results
Vitis HLS	Estimated F_max	144.68	122.68	144.45		MHz
Vivado - Elaboration	RTL_MULT	1	16	16	8
Vivado - Elaboration	RTL_ADD (inc address logic)	2	47	44	8
Vivado	Critical Path Post Synthesis	328.95	119.93	115.33		MHz
Vivado	Critical Path Post Implementation	288.18	106.70	104.24		MHz
Vivado	SLICE	33	800	542
Vivado	LUT	93	1824	1215
Vivado	FF	95	2278	2113
Vivado	DSP	3	48	48
Me	Loops	64	8	8	8
Vitis HLS	Per loop (Initiation Interval)	1	1	1	1	Cycles
Vitis HLS	Pipeline Delay?	4	11	10	2
Vitis HLS	Pipeline Path Depth	2	5	5	2
Me	Assumed Calculation Time	64 + 2	8 + 5	8 + 5	8 + 2	Cycles

Simply by comparing the use of multipliers and adders the resulting code is inefficient. We do not appear to be at a level of maturity yet where die hard logic designers will be persuaded to adopt this technology. If you are going to silicon with the permanence of an ASIC or even the effort of an FPGA, surely it makes sense still to code for efficiency? A sufficient design can be sketched and coded in VHDL with generics to manage features that might be considered tradable. This design entry might be touted to software developers without FPGA experience, except there is still the wider design that presents and consumes data to this compoenent to consider, e.g. RF input, I/O pins and constraints files.

References

No feedback yet

Form is loading...

Technology Blogs