Having placed the primitives during implementation, they then need to be connected during the routing stage. Only after "routing" do we know the actual net delays. (I won't suggest the two stages are so mutually exclusive, and perhaps some placing decisions are based on routability, at least at some stage.) In a congested design, wires may take detours to reach their connections, generating longer delays and occupying more routing resources. Routing congestion degrades the design performance and even leads to implementation failures, i.e. failure to meet timing requirements. So what do you do when the tools complain about congested routing, and start making multiple iterations to see if they can improve on the negative slack?
- Congestion Reporting
- Xilinx "QuickTake" Video
- Control Sets
- Defaults Illustrated for Kintex-7 Device
- Control Signals
- Clock Enables
- Reducing the Number of Control Sets
- Control Set Remapping
- Resets
- Comparison of Resets
- Hybrid
- Affect of Reset Removal on Control Sets
- Coding Partial Resets
- Full Reset
- Bad Partial Reset
- Good Partial Reset
- Coding Style
- Original Full Reset
- Partial Reset With Duplication
- Partial Reset Without Duplication
- Shift Register LUT (SRL)
- Conclusions
- References
As the number of unique control sets increases, it becomes more difficult to fit a device.
This situation can also result in routing congestion which can cause timing degradation or nets that cannot route completely.
So the above is a little dated now, but seems From the various literature sources I glean a basic plan:
- Increase flexibility in primitive placement
- Reduce routing demand
The former is very much related to the concept of "Control Sets" and the latter to reduction in fanout. This approach feels very much in line with some of the "Artificial Intelligence" papers that aim to predict routing congesting in order to better guide placement.
- Congestion Prediction in FPGA Using Regression Based Learning Methods
- Faster FPGA Routing by Forecasting and Pre-Loading Congestion Information
- Machine Learning Based Routing Congestion Prediction in FPGA High-Level Synthesis
Congestion Reporting
As for many important design problems in FPGA, there's a report for everything. For congestion, its an optional part of the report_design_analysis command.
report_design_analysis generates a congestion table that provides details about the nature of congestion and region(s) associated with the highest congestion in a particular direction and type.
I struggle to glean any useful information from the above report that allows me to decide what changes need to be made. This means I'm diving into manuals to find more general clues about how to avoid routing congestion.
There's a detailed blog on the Xilinx Forums, 66314 - Vivado Congestion, that covers multiple approaches or perhaps 'experiments' to try. I'll try to unpick a subset of that below.
Xilinx "QuickTake" Video
Points to note from the video are:
- High fanout control signals
- Determine if the signals that have a fanout > 1000 are resets or clock enables
- There is less flexibility in how the design gets implemented when the device utilization is high (usually over 80%)
- Avoid asynchronous resets
- They prevent logic from being merged into the block RAM and DSP slice resources
- SRLs cannot be inferred with any reset behaviour (NB. on this one I beg to qualify some loose wording of advice and demonstrate a more precise answer based on experiments with Vivado.
- Disable KEEP_HIERARCHY options and/or attributes during synthesis to ensure all possible optimizations can be done by your synthesis tool
- Avoid asynchronous resets
Control Sets
When reading up on congestion, flexibility and routability, a reoccurring topic is that of "control sets" and how they affect packing efficiency.
A control set is the grouping of control signals (set/reset, clock enable and clock) that drives any given SRL, LUTRAM, or register. For any unique combination of control signals, a unique control set is formed. This is important, because registers within a 7 series slice all share common control signals, and thus, only registers with a common control set can be packed into the same slice. For example, if a register with a given control set has just one register as a load, the other seven registers in the slice it occupies will be unusable.
Control Signals and Control Sets, Xilinx UG949
What's included in a Control Set:
- Clock
- Synchronous
- Set (SET)
- Reset (RESET)
- Asynchronous
- Set (PRE)
- Reset (CLR)
- Chip Enable (CE)
Designs with too many unique control sets might have many wasted resources as well as fewer options for placement, resulting in higher power and lower achievable clock frequency. Designs with fewer control sets have more options and flexibility in terms of placement, generally resulting in improved results.
Control Signals and Control Sets, Xilinx UG949
You can force control set mapping by applying the DIRECT_RESET / DIRECT_ENABLE / EXTRACT_RESET / EXTRACT_ENABLE attributes as needed to handle the mapping of control sets for a given structure.
Controlling Enable/Reset Extraction with Synthesis Attributes, Xilinx UG949
To use control set mapping you can apply attributes to the nets connected to enable/reset signals, which will force synthesis to use the CE/R pin.
Using DIRECT_ENABLE and DIRECT_RESET, Xilinx UG949
I'll illustrate the effects of these attributes (properties) below.
set_property DIRECT_RESET true \
[get_nets -of_objects [get_ports {ra}]]
set_property DIRECT_ENABLE true \
[get_nets -of_objects [get_ports {ena}]]
B
set_property EXTRACT_RESET true \
[get_nets -of_objects [get_ports {rb}]]
set_property EXTRACT_ENABLE true \
[get_nets -of_objects [get_ports {enb}]]
C - Defaults, the dependencies here are covered later.
When the design includes a synchronous reset/enable, synthesis creates a logic cone mapped through the CE/R/S pins when the load is equal to or above the threshold set by the -control_set_opt_threshold synthesis switch, or creates a logic cone that maps through the D pin if below the threshold. The default thresholds are:
- 7 series devices: 4
- UltraScale devices: 2
Controlling Enable/Reset Extraction with Synthesis Attributes, Xilinx UG949
Defaults Illustrated for Kintex-7 Device
For the 7 series devices the default control_set_opt_threshold is 4. This means when the control set fanout increases from 3 to 4, emulation in the data path is replaced by direct connections as illustrated below.
It is not immediately obvious to me why these default values of control_set_opt_threshold are chosen, and why they vary between device families.
When a logic path ends at a fabric register (FD) clock enable, or synchronous set/reset, the property on the register instructs Vivado logic optimization to map the enable or reset signal to the data pin (D), which has a dedicated LUT connection and can be faster. If possible, the logic is combined with an existing LUT driving the D-input to prevent the insertion of extra levels of logic.
CONTROL_SET_REMAP, Xilinx UG912
Experimentation shows consistently that setting control_set_opt_threshold high causing the LUT to be employed more readily can be easily overridden with DIRECT_* properties. Setting control_set_opt_threshold to zero to cause the direct control set connections cannot be so readily overridden by EXRACT_* properties. There must be timing driven decisions preventing the request, and this is supported by the more successful attempt to remap the control sets below.
Control Signals
This section pulls together the relevant parts of Xilinx's User Guides. In that respect it is a bit lame and un-insightful. However, at the end of this section you will see a more reliable way of affecting control set remapping.
Tips for Control Signals
- Check whether a global reset is really needed - The lead I followed
- Avoid asynchronous control signals
- Keep clock, enable, and reset polarities consistent
- Do not code a set and reset into the same register element
- If an asynchronous reset is absolutely needed, remember to synchronize its de-assertion.
Tips for Control Signals, Xilinx UG949
Clock Enables
Clock Enables
When used properly, clock enables can significantly reduce design power with little impact on area or maximum clock frequency. However, when clock enables are used improperly, they can lead to:
- Increased resource utilization
- Decreased placement density
- Increased power
- Reduced achievable clock frequency
In most cases, low fanout clock enables are the main contributor to the high number of control sets.
Clock Enables, Xilinx UG949
Creating Clock Enables
Clock enables are created when an incomplete conditional statement is coded into a synchronous block. A clock enable is inferred to retain the last value when the prior conditions are not met. When this is the desired functionality, it is valid to code in this manner. However, in some cases when the prior conditional values are not met, the output is a don't care. In that case, Xilinx recommends closing off the conditional (that is, use an else clause), with a defined constant (that is, assign the signal to a one or a zero).
In most implementations, this does not result in added logic, and avoids the need for a clock enable. The exception to this rule is in the case of a large bus when inferring a clock enable in which the value is held can help in power reduction. The basic premise is that when small numbers of registers are inferred, a clock enable can be detrimental because it increases control set count. However, in larger groups, it can become more beneficial and is recommended.
Creating Clock Enables, Xilinx UG949
Reducing the Number of Control Sets
If the number of control sets is high, use one of the following strategies to reduce their number:
- Remove the MAX_FANOUT attributes that are set on control signals in the HDL sources or constraint files. Replication on control signals dramatically increases the number of unique control sets. Xilinx recommends relying on place_design to perform coarse replication and using phys_opt_design -directive Explore for finer replication after placer. This prevents unnecessary replication and equivalent control sets from crossing each other, which can lead to routing congestion.
- Increase the control set threshold of Vivado synthesis (or other synthesis tool). Review the control sets fanout distribution table in report_control_sets -verbose to determine a more appropriate control sets threshold to use during synthesis. Note that increasing contol_set_opt can have negative impacts on power by eliminating clock enables that can actively reduce power. For example:
- synth_design -control_set_opt_threshold 16
Tip: Use the BLOCK_SYNTH synthesis constraints to change the control sets threshold on modules that are the most impacted by placement spreading or congestion.- Use opt_design -control_set_merge or opt_design -merge_equivalent_drivers to merge equivalent control sets after synthesis.
- Use the CONTROL_SET_REMAP property to map low-fanout control signals driving the synchronous set/reset and/or CE pin of a register to the D-input. For more information, see Control Set Reduction in the Vivado Design Suite User Guide: Implementation (UG904).
- Avoid low fanout asynchronous set/reset (preset/clear), because they can only be connected to dedicated asynchronous pins and cannot be moved to the datapath by synthesis. For this reason, the synthesis control set threshold option does not apply to asynchronous set/reset.
- Avoid using both active-High and active-Low of a control signal for different sequential cells.
- Only use clock enable and set/reset when necessary. Often data paths contain many registers that automatically flush uninitialized values, and where set/reset or enable signals are only needed on the first and last stages.
Reduce the Number of Control Sets, Xilinx UG949
Looking for the more general design principles affecting VHDL and constraints, it seems removing clock enables by recoding for a complete if-else clause and CONTROL_SET_REMAP properties in constraints are the most likely leads. I've already shown the limits of control set remapping with EXTRACT_* properties prior to synthesis. So now I'll experiment with the CONTROL_SET_REMAP property which takes effect during implementation (i.e. not during synthesis).
Control Set Remapping
Control Set Reduction
Designs with several unique control sets can have fewer options for placement, resulting in higher power and lower performance. Designs with fewer control sets have more options and flexibility in terms of placement, generally resulting in improved results. The number of unique control sets can be reduced by applying the CONTROL_SET_REMAP property to a register that has a control signal driving the synchronous set/reset pin or CE pin. This triggers the optional control set reduction phase and maps the set/reset and/or CE logic to the D-input of the register. If possible, the logic is combined with an existing LUT driving the D-input, which prevents extra levels of logic.
Control Set Reduction, Xilinx UG904
From experimentation on small designs, this optimisation is not always applied during opt_design or implementation, probably because it was entirely unnecessary. However the images below illustrate the changes between synthesis and implementation when it is successfully applied. I'll return to control sets briefly again later when considering the reset input.
Both the CE and R pins to the FDREs get tied and the reset (rst) and the clock enables are applied to the local LUT. A second consequence of my forcing the implementation to extract the resets and enables was to fail hold timing after implementation.
While the EXTRACT_* properties might not readily be applied, the CONTROL_SET_REMAP properties are later in the implementation stage, and they can worsen the static timing results. Remapping control sets feels like it is of limited benefit. Unless you can recode the VHDL to remove the need for the clock enable, there's not much you can do to remap them without reducing or going negative on timing slack. So you are left playing with global settings for the place and route commands and throwing more algorithms at your design.
Having reduced the number of control sets you then expect the tools will take advantage. A separate post on practical control set reduction explores the ease with which the tools can achieve a more efficient packing and hence reduce routing usage.
Resets
So if clock enable nets now have limited scope for reduction in fanout, we are left looking at the reset signal(s). Can we reduce the fanout in order reduce routing congestion?
When and Where to Use a Reset
Xilinx devices have a dedicated global set/reset signal (GSR). This signal sets the initial value of all sequential cells in hardware at the end of device configuration.
If an initial state is not specified, sequential primitives are assigned a default value. In most cases, the default value is zero. Exceptions are the FDSE and FDPE primitives that default to a logic one. Every register will be at a known state at the end of configuration. Therefore, it is not necessary to code a global reset for the sole purpose of initializing a device on power up.
Xilinx highly recommends that you take special care in deciding when the design requires a reset, and when it does not. In many situations, resets might be required on the control path logic for proper operation. However, resets are generally less necessary on the data path logic. Limiting the use of resets:
- Limits the overall fanout of the reset net.
- Reduces the amount of interconnect necessary to route the reset.
- Simplifies the timing of the reset paths.
- Results in many cases in overall improvement in clock frequency, area, and power.
When and Where to Use a Reset, Xilinx UG949
To understand the difference between control and data path logic I refer you to the section above on Control Sets. Essentially "control path logic" terminates in a reset or clock enable pin on a sequential primitive. (For completeness, you wouldn't put logic in the clock path now would you!)
Why Remove Resets?
- Massive reduction in fanout of a generally high fanout net (or nets)
- It's a synchronous signal so needs to meet timing too
- Reduce routing requirements
- A 64-bit bus requires routing reset to 64 locations
- Overall: reduces tension in place & route
Best to start with a quick revision on why Xilinx recommends synchronous resets.
Comparison of Resets
Synchronous | Asynchronous |
---|---|
|
|
Crucially, a synchronous reset ensures that you do not come out of the reset condition in the period between the setup and hold time, potentially causing meta-stability.
Hybrid
Intel's documentation covers this hybrid reset, going into the reset condition asynchronously and coming out synchronously.
2.3.1.3. Use Synchronized Asynchronous Reset
To avoid potential problems associated with purely synchronous resets and purely asynchronous resets, you can use synchronized asynchronous resets. Synchronized asynchronous resets combine the advantages of synchronous and asynchronous resets.
These resets are asynchronously asserted and synchronously deasserted. This takes effect almost instantaneously, and ensures that no datapath for speed is involved. Also, the circuit is synchronous for timing analysis and is resistant to noise.
Use Synchronized Asynchronous Reset, Intel Quartus Prime Pro Edition User Guide: Design Recommendations
Affect of Reset Removal on Control Sets
Perhaps an obvious statement here, reset removal does not reduce the number of control sets. In a rather basic design I measured the control set usage, then removed the reset keeping the functionality, and then remeasured the control sets. The before and after pictures differ only by the highlighted rst. If you just take out the resets, you still have separation by clock enable, and the same number of control sets.
Coding Partial Resets
What follows is perhaps not all that obvious as those that understand it might think.
Full Reset
This is the example from which we will remove the reset on 'b'. The results from synthesis in Vivado are very much unremarkable.
Bad Partial Reset
Here we demonstrate the problem with the naïve way of removing the reset from a signal. The register b is only assigned when not reset. This is aptly illustrated by the small LUT providing the inverter on bo_reg's CE pin. NB. You will need to set the property control_set_opt_threshold to 0 to replicate this schematic as otherwise a LUT is coded on the register's D input. Actually that a perfectly good solution, its just less clear for purposes of this illustration what's going on.
Good Partial Reset
To do partial resets properly, code the reset clause last so that it overrides any of the logic assignments above.
Coding Style
Here I want to caution against the suggestion that separation of the assignments is a better solution. Consider the case where a condition for assignment is shared.
Original Full Reset
Partial Reset With Duplication
Now you will have to duplicate the condition twice, which feels unnecessary because it is avoidable. Also note the option to code a global reset signal (GSR) value to remove 'X's from the simulation by initialising the signal's value in its declaration, and ridding ModelSim of those red coloured traces. Some might prefer to not to initialise signals in order to verify that unknown values do not propagate through the design.
Partial Reset Without Duplication
The condition can remain factored out when coding the reset condition last. This style feels preferable.
Shift Register LUT (SRL)
Just a final point about the SRL inferencing mentioned in the Xilinx QuickTake video.
Approximately two-thirds of the slices are SLICEL logic slices and the rest are SLICEM, which can also use their LUTs as distributed 64-bit RAM or as 32-bit shift registers (SRL32) or as two SRL16s.
7 Series FPGAs Configurable Logic Block, Xilinx UG474
This means consecutive FDREs can be converted to a single SRL32 or two SRL16s without a reset condition. This is helpful when you consider that each SLICE (M or L) provides 8 LUTs and 16 registers and hence there is potential to double the number of registers packed into a SLICE. When coding without reset clauses you get this efficiency for free. Using the A0-A5 inputs to the SRLs, the delay can be shortened (or even dynamic).
Note on conversion the first and last FDREs are retained as standard registers and not converted. This is the default SRL style, see the explanation on the SRL_STYLE attribute in order to change this behaviour.
When using a reset clause, it is possible to get the SRL conversion too, but additional logic is added to "fake" the reset by gating the SRL output. A previous blog provides a more precise analysis.
Conclusions
Control sets feel largely like a bit of a distraction when trying to reduce routing congestion. The biggest bang for buck is removing resets. I don't like this reset coding style because I prefer to see the reset clause up front. I will gladly suffer it because of the beneficial effect on meeting timing of not resetting all registers, and the ease of using this new style.
References
- UltraFast Design Methodology Guide for Xilinx FPGAs and SoCs, Xilinx UG949
- Vivado Design Suite Properties Reference Guide, Xilinx UG912
- 66314 - Vivado Congestion, Xilinx Support Forum
- Practical Control Set Reduction