Having completed First Attempt at High Level Synthesis, I thought I would explore the Vitis HLS tool with something slightly larger, a FIR filter. There is a C++ library for realising FIR filters, but it is dressed up in lots of #ifndef directives so that it is too obfuscated for me, and I wanted something simple to play with. I created a 64 coefficient FIR filter (#define ARR_SIZE 64) using the following code.
#include "fir.h"
int fir (int a[], int coeffs[], unsigned int len) {
int b = 0;
for(int i = 0; i < len; i++) {
b += a[i] * coeffs[i];
}
return b;
}
int fir_fixed (int a[ARR_SIZE], int c[ARR_SIZE]) {
return fir(a, c, ARR_SIZE);
}
Default Synthesis
Running this through Vitis with the default settings gives a memory interface to an external RAM such that the design would sequentially read each FIR filter coefficient and data pair, multiply and accumulate.

It uses one multiplier (red), an adder (pink) for the accumulator and another adder (green) for the address increment.

The data path is two registers deep, but the output is unregistered.

I am confused by the implication that the number of cycles required in each loop is 4, when the data path is 2 registers deep. The Initiation Interval is 1, so the data is fed every clock cycle without stalling
Exploring the Design Space
Let's amend the implementation to read 8 pairs of coefficient and data in parallel at the same time. I would expect a good implementation of the code to look something like the follow:

The code was amended with pragmas as follows.
#include "fir.h"
int fir (int a[], int coeffs[], unsigned int len) {
int b = 0;
#pragma HLS ARRAY_PARTITION variable=a dim=1 factor=8 type=block
#pragma HLS ARRAY_PARTITION variable=coeffs dim=1 factor=8 type=block
for(int i = 0; i < len; i++) {
#pragma HLS UNROLL
b += a[i] * coeffs[i];
}
return b;
}
int fir_fixed (int a[ARR_SIZE], int c[ARR_SIZE]) {
return fir(a, c, ARR_SIZE);
}

The unrolled loop has 8 multipliers (red) feeding an adder tree (pink) which has been pipelined. The data path below shows 5 registers.

The schedule viewer shows work is happening in parallel, but also show a timing violation.

The summary results reports that timing has been met (critical path of 8.2 ns) and there is a timing violation with negative slack of -0.65 ns. Which is it? The loop latency is 11 cycles, but the data path is only 5 registers deep.

"A timing violation is a path of operations requiring more time than the available clock cycle. To visualize this, the problematic operation is represented in the Schedule Viewer in a red box."
I'm finding this hard to believe at present. The tooling does provide the means to identify the line of source code, but I only have one of real interest.

"Solving timing violations in Vitis HLS using the Schedule Viewer involves identifying critical paths that exceed the clock period and implementing code-level or directive-based optimizations to break up long logic chains."
Some old documentation hinted at a "guidance window", but that appears to be for the old Eclipsed-based tool and I could not find any context sensitive help in Vitis version 2025.2.1. I decided to trial an additional directive to fix the timing violation.
Pipelining
I added a PIPELINE directive to the C code as follows. Note the PIPELINE directive also infers UNROLL.
#include "fir.h"
int fir (int a[], int coeffs[], unsigned int len) {
int b = 0;
#pragma HLS ARRAY_PARTITION variable=a dim=1 factor=8 type=block
#pragma HLS ARRAY_PARTITION variable=coeffs dim=1 factor=8 type=block
#pragma HLS PIPELINE
for(int i = 0; i < len; i++) {
b += a[i] * coeffs[i];
}
return b;
}
int fir_fixed (int a[ARR_SIZE], int c[ARR_SIZE]) {
return fir(a, c, ARR_SIZE);
}

The pipelined loop has 8 multipliers (red) feeding an adder tree (pink) as before. This data path below also shows 5 registers. The schedule viewer shows the timing violation has been resolved.

Results
NB. Nothing has been verified in RTL simulation as that would require coding for the control logic added to the function parameters and return values.
The adder trees created by Vitis are incomplete. This indicates an inefficient use of logic as it incurs a greater use of registers in the final design.

| Tool | Settings | Expt. 1 | Expt. 2 | Expt. 3 | Expected | Units |
|---|---|---|---|---|---|---|
| Vitis HLS | ARRAY_PARTITION | 1 | 8 | 8 | 8 | |
| Vitis HLS | UNROLL | 0 | 1 | Inferred | Inferred | |
| Vitis HLS | PIPELINE | 0 | 0 | 1 | 1 | |
| Vitis HLS | Requested Clock Frequency | 100 | 100 | 100 | MHz | |
| Results | ||||||
| Vitis HLS | Estimated Fmax | 144.68 | 122.68 | 144.45 | MHz | |
| Vivado - Elaboration | RTL_MULT | 1 | 16 | 16 | 8 | |
| Vivado - Elaboration | RTL_ADD (inc address logic) | 2 | 47 | 44 | 8 | |
| Vivado | Critical Path Post Synthesis | 328.95 | 119.93 | 115.33 | MHz | |
| Vivado | Critical Path Post Implementation | 288.18 | 106.70 | 104.24 | MHz | |
| Vivado | SLICE | 33 | 800 | 542 | ||
| Vivado | LUT | 93 | 1824 | 1215 | ||
| Vivado | FF | 95 | 2278 | 2113 | ||
| Vivado | DSP | 3 | 48 | 48 | ||
| Me | Loops | 64 | 8 | 8 | 8 | |
| Vitis HLS | Per loop (Initiation Interval) | 1 | 1 | 1 | 1 | Cycles |
| Vitis HLS | Pipeline Delay? | 4 | 11 | 10 | 2 | |
| Vitis HLS | Pipeline Path Depth | 2 | 5 | 5 | 2 | |
| Me | Assumed Calculation Time | 64 + 2 | 8 + 5 | 8 + 5 | 8 + 2 | Cycles |
Simply by comparing the use of multipliers and adders the resulting code is inefficient. We do not appear to be at a level of maturity yet where die hard logic designers will be persuaded to adopt this technology. If you are going to silicon with the permanence of an ASIC or even the effort of an FPGA, surely it makes sense still to code for efficiency? A sufficient design can be sketched and coded in VHDL with generics to manage features that might be considered tradable. This design entry might be touted to software developers without FPGA experience, except there is still the wider design that presents and consumes data to this compoenent to consider, e.g. RF input, I/O pins and constraints files.