Practical Control Set Reduction
In a previous post, Taking Xilinx's Advice on Reducing Routing Congestion, Xilinx provided advice that reducing control sets improves the efficiency of slice packing and frees up routing resources. This post explores how to get Vivado to achieve this promise.
The same VHDL code is used throughout this post, and only the constraints are varied to achieve the desired effect of control set reduction.
library ieee;
use ieee.std_logic_1164.all;
entity control_set_array is
generic(
width : positive := 4;
depth : positive := 2
);
port(
clk : in std_logic;
reset : in std_logic;
d : in std_logic_vector(width-1 downto 0);
ces : in std_logic_vector(width-1 downto 0);
q : out std_logic_vector(width-1 downto 0)
);
end entity;
architecture rtl of control_set_array is
begin
shift_g : if width = 1 generate
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
q <= (others => '0');
else
for i in ces'range loop
if ces(i) = '1' then
q(i) <= d(i);
end if;
end loop;
end if;
end if;
end process;
else generate -- width >= 2
type slv_arr_t is array (natural range <>) of std_logic_vector;
signal dd : slv_arr_t(width-1 downto 0)(depth-2 downto 0);
begin
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
dd <= (others => (others => '0'));
q <= (others => '0');
else
for i in ces'range loop
if ces(i) = '1' then
-- Cannot use aggregate assignment "(dd(i), q(i)) <= d(i) & dd(i);" here as the compiler complains
-- "Error: aggregate targets in an assignment must all be locally static names"
-- Does not work with unwrapping a generate loop either.
dd(i) <= d(i) & dd(i)(depth-2 downto 1);
q(i) <= dd(i)(0);
end if;
end loop;
end if;
end if;
end process;
end generate;
end architecture;
The generic depth is set to 2 for several reasons:
- To ensure that there exists a register to register path for static timing analysis of the out of context synthesis results,
- The ratio of registers to LUTs in a slice is 8:4, and a depth of two has the two LUT3s packed into a single LUT6, hence one LUT for two registers
- Any more, e.g. >= 4, would allow packing of the slices without needing to work with different control sets.
Direct Connections
Given a design with direct connections to both reset and chip enable pins on the registers as illustrated below, the implementation stage in Vivado is constrained in its placement by multiple control sets. Each pair of registers with the same chip enable can be placed in the same slice. This means the four slices are required to place all the logic in this component.
1. Summary ---------- +---------------------------------------------------+-------+ | Status | Count | +---------------------------------------------------+-------+ | Total control sets | 4 | | Minimum number of control sets | 4 | | Addition due to synthesis replication | 0 | | Addition due to physical synthesis replication | 0 | +---------------------------------------------------+-------+


The pblock contains the registers packed as expected with only two registers per slice. This is as expected given the deliberate choice of value for depth above.
create_pblock pblock_1
add_cells_to_pblock -clear_locs [get_pblocks pblock_1] [get_cells {{shift_g.dd_reg[*][0]} {shift_g.q_reg[*]}}]
resize_pblock [get_pblocks pblock_1] -replace -add {SLICE_X36Y99:SLICE_X37Y100}
Now we should be able to reduce the number of slices used in the above example to one. The first step is to convert them to logic on the D pin.
Extracted Logic
There are several ways this can be achieved:
- control_set_opt_threshold option on synth_design, which extracts resets too, or
- EXTRACT_ENABLE properties on the respective nets. (NB. setting the EXTRACT_RESET property did not have the expected effect here.)
# Not working on its own, needs -control_set_opt_threshold setting to be changed.
#set_property EXTRACT_RESET true [get_nets -of_objects [get_ports {reset}]]
#set_property EXTRACT_ENABLE true [get_nets -of_objects [get_ports {ces[*]}]]
# Works well
set_property CONTROL_SET_REMAP ENABLE [get_cells shift_g.* -filter {IS_SEQUENTIAL}]
Worth noting the ineffectiveness of the EXTRACT_RESET constraint that did not work without setting -control_set_opt_threshold separately on the synth_design command line. An annoyance to investigate another time. However CONTROL_SET_REMAP ENABLE does seem to be effective at migrating logic from the CE pin to D pin.
1. Summary ---------- +---------------------------------------------------+-------+ | Status | Count | +---------------------------------------------------+-------+ | Total control sets | 1 | | Minimum number of control sets | 1 | | Addition due to synthesis replication | 0 | | Addition due to physical synthesis replication | 0 | +---------------------------------------------------+-------+

# PBlock placement - has sufficient capacity but still scatters logic outside the requested area
create_pblock pblock_1
add_cells_to_pblock -clear_locs [get_pblocks pblock_1] [get_cells {{shift_g.dd_reg[*][0]} {shift_g.q_reg[*]}}]
resize_pblock [get_pblocks pblock_1] -replace -add {SLICE_X36Y99:SLICE_X37Y100}
My initial attempt to constrain the implementation to a single slice using a pblock constraint was unsuccessful. Firstly the minimum pblock size is two slices. Secondly the placement was scattered outside the pblock even through there was supposed to be sufficient logic for the required solution.

Xilinx says it should be possible to place all the logic in a single slice. The assumption here is that as the design is so small by comparison with the device, the placer is just not trying. So I constrained the design placement entirely manually to check that the registers previously with different control sets could now be combined into a single slice.
# Manual placement into a single SliceL
set_property BEL D5LUT [get_cells {shift_g.q[0]_i_1}]
set_property BEL D6LUT [get_cells {shift_g.dd[0][0]_i_1}]
set_property BEL DFF [get_cells {shift_g.dd_reg[0][0]}]
set_property BEL D5FF [get_cells {shift_g.q_reg[0]}]
set_property BEL C5LUT [get_cells {shift_g.q[1]_i_1}]
set_property BEL C6LUT [get_cells {shift_g.dd[1][0]_i_1}]
set_property BEL CFF [get_cells {shift_g.dd_reg[1][0]}]
set_property BEL C5FF [get_cells {shift_g.q_reg[1]}]
set_property BEL B5LUT [get_cells {shift_g.q[2]_i_1}]
set_property BEL B6LUT [get_cells {shift_g.dd[2][0]_i_1}]
set_property BEL BFF [get_cells {shift_g.dd_reg[2][0]}]
set_property BEL B5FF [get_cells {shift_g.q_reg[2]}]
set_property BEL A5LUT [get_cells {shift_g.q[3]_i_1}]
set_property BEL A6LUT [get_cells {shift_g.dd[3][0]_i_1}]
set_property BEL AFF [get_cells {shift_g.dd_reg[3][0]}]
set_property BEL A5FF [get_cells {shift_g.q_reg[3]}]
# This must come after the BEL property settings
set_property LOC SLICE_X36Y97 [get_cells * -filter {PRIMITIVE_GROUP == FLOP_LATCH || PRIMITIVE_GROUP == LUT}]

So now we have achieved the desired efficient packing, but with little help from the synthesis tool. As this method is not automated it does not scale.
Scaling up the size
The next experiment is to see if it is possible to ramp up the congestion and see if the placer can achieve the more dense packing when required. Also, for this simple design the timing should not degrade. Initially it looks like this simple example has reached the limits of its usefulness. Setting the depth generic to 128 gives a scattered placement and will not be constrained more tightly by floor planning. However it turns out that by default pblocks have an attribute IS_SOFT set to true.




The images above show alternative results under different conditions. The TCL source code to achieve the manual regular layout is on Github.
IS_SOFT
This is a Pblock property that indicates whether the Pblock must strictly be obeyed.
When the IS_SOFT property is set to TRUE, Pblocks are ignored starting with physical synthesis in placer through the end of the implementation flow. This approach is particularly helpful for preserving the overall placement while giving additional flexibility to placement algorithms that reduce congestion, move logic closer to optimal locations, and increase the efficiency of physical optimizations.
I suggest the IS_SOFT property is poorly thought out by Xilinx. Surely the expected behaviour would be 'hard' and hence constrain the area used. If placement fails, then amend the pblock. Starting off with this as 'soft' is the unexpected behaviour leading developers to think pblocks do not work as I initially did. I'm not impressed.
set_property IS_SOFT false [get_pblocks pblock_1]
Having fixed the containment issue, Vivado now manages to achieve the same results as my manually placed result.
Timing
These results are for a width generic value of 128, so 256 registers in total. The figures are derived using a 100 MHz clock (10 ns period) for out of context synthesis and implementation.
Control Sets | Worst negative setup slack, WNS (ns) | Worst negative hold slack, WHS (ns) | Maximum slack (ns) |
---|---|---|---|
Directed | 7.946 | 0.040 | 7.986 |
Extracted | 7.972 | 0.016 | 7.988 |
Difference | 0.026 | -0.024 | 0.002 |
The table shows that for this example VHDL code, the extraction of control set logic to the LUT provides a different timing profile with a slight improvement in setup slack at the expense of a reduction in hold slack. Vivado now has two implementations to choose from depending on the setup and hold slacks of any adjacent logic, or the power saving requirements aided by using the chip enable pin.
Conclusions
It is possible to reduce control sets and achieve a more efficient packing of primitives into slices. Getting Vivado to take advantage of the reduced control sets to achieve a more efficient logic packing may require constraining placement using pblocks, which need the default IS_SOFT property amending to be effective. The need to constrain the design with pblocks may contribute to the ineffective contribution to reducing routing congestion.