Practical Control Set Reduction
In a previous post, Taking Xilinx's Advice on Reducing Routing Congestion, Xilinx provided advice that reducing control sets improves the efficiency of slice packing and frees up routing resources. This post explores how to get Vivado to achieve this promise.
The same VHDL code is used throughout this post, and only the constraints are varied to achieve the desired effect of control set reduction.
The generic depth is set to 2 for several reasons:
- To ensure that there exists a register to register path for static timing analysis of the out of context synthesis results,
- The ratio of registers to LUTs in a slice is 8:4, and a depth of two has the two LUT3s packed into a single LUT6, hence one LUT for two registers
- Any more, e.g. >= 4, would allow packing of the slices without needing to work with different control sets.
Direct Connections
Given a design with direct connections to both reset and chip enable pins on the registers as illustrated below, the implementation stage in Vivado is constrained in its placement by multiple control sets. Each pair of registers with the same chip enable can be placed in the same slice. This means the four slices are required to place all the logic in this component.
It is worth noting that I had to wrestle with Vivado's pblocks to floor plan this for ease of illustration. The original four slice pblock definition had to be shifted down one row in the matrix of slices in order for the logic to be contained within the requested pblock. Otherwise it was half in and half out, occupying the next row below. There were no warnings, and overall this feels like Vivado is a bit careless with respecting floor planning constraints.
Now we should be able to reduce the number of slices used to one. The first step is to convert them to logic on the D pin.
Extracted Logic
There are several ways this can be achieved:
- control_set_opt_threshold option on synth_design, which extracts resets too, or
- EXTRACT_ENABLE properties on the respective nets. (NB. setting the EXTRACT_RESET property did not have the expected effect here.)
Worth noting the ineffectiveness of the EXTRACT_RESET constraint that did not work without setting -control_set_opt_threshold separately on the synth_design command line. An annoyance to investigate another time.
My initial attempt to constrain the implementation to a single slice using a pblock constraint was unsuccessful. Firstly the minimum pblock size is two slices. Secondly the placement was scattered outside the pblock even through there was supposed to be sufficient logic for the required solution.
Xilinx says it should be possible to place all the logic in a single slice. The assumption here is that as the design is so small by comparison with the device, the placer is just not trying. So I constrained the design placement entirely manually to check that the registers previously with different control sets could now be combined into a single slice.
So now we have achieved the desired efficient packing, but with little help from the synthesis tool. As this method is not automated it does not scale.
Timing
The following figures are derived using a 100 MHz clock (10 ns period) for out of context synthesis and implementation.
Control Sets | Worst negative setup slack, WNS (ns) | Worst negative hold slack, WHS (ns) | Maximum slack (ns) |
---|---|---|---|
Directed | 7.985 | 0.031 | 7.954 |
Extracted | 8.008 | 0.018 | 7.990 |
Difference | 0.023 | -0.013 | 0.036 |
The table shows that for this example VHDL code, the extraction of control set logic to the LUT provides an overall improvement of timing. The increase in setup slack is more than the reduction in hold slack, giving 0.036 ns more time for signal propagation. Vivado now has two implementations to choose from depending on the setup and hold slacks of any adjacent logic.
Scaling up the size
The next experiment is to see if it is possible to ramp up the congestion and see if the placer can achieve the more dense packing when required. Also, for this simple design the timing should not degrade.
Unfortunately it looks like this simple example has reached the limits of its usefulness. Setting the depth generic to 128 gives the placement above, and will not be constrained more tightly by floor planning. I note that many shift registers have a hold time violations after implementation of -0.021 ns slack, i.e. negative hold slack. This is caused by a change in Hold_FDRE_C_D between synthesis and implementation, with the value changing from 0.218 ns to a range 0.216 - 0.255 ns. As the timing violation is on the boundary, the solution here is to amend the out of context (OOC) constraints to the worst case slow process minimum hold delay for Hold_FDRE_C_D of 0.255 ns and the warnings evaporate. This is deemed justified since it is within the aims of the original intent for setting up OOC synthesis. The inability to automatically control placement at scale means no further experimentation is deemed useful.
The two images above show alternative results under different conditions. The TCL source code to achieve this regular layout is on Github.
Conclusions
It is possible to reduce control sets and achieve a more efficient packing of primitives into slices. The difficulty in getting Vivado to take advantage of the reduced control sets to achieve a more efficient logic packing into slices may significantly contribute to the ineffective contribution to reducing routing congestion.