Power Reduction using Vivado
Power Basics
The paper by Lars Wanhammar at VLSI Circuit Technologies - 2.3.4 Power Dissipation in CMOS Circuits explains the power consumption in CMOS circuits. CMOS circuits have negligible static power dissipation. The leakage current is in the nanoampere range. Typically, the power consumption due to leakage is less than 1% of the total power consumption. Significant power is dissipated when the output switches from one state to the other. The stray and load capacitances are, during a complete switch cycle, charged and discharged through the p-transistor and the n-transistor, respectively.
\[ P=f \times C \times V^2 \]Where:
- P = power
- f = frequency
- C = capacitance
- V = voltage
Using Clock Enables (CEs)
The Vivado power optimizer takes advantage of the abundant supply of Clock Enables (CEs). Power optimization creates gating logic to drive register clock enables such that registers only capture data on relevant clock cycles.
Note that in actual silicon, CEs are actually gating the clock rather than selecting between the D input and feedback Q output of the flip-flop. This increases the performance of the CE input but also reduces clock power.
This optimisation is achieved using a Vivado TCL command, power_opt_design, described as follows:
power_opt_design
The power_opt_design command analyzes and optimizes the design. It analyzes and optimizes the entire design as a default. The command also performs intelligent clock gating to optimize power.
Experiment
I wanted to verify that I can apply this technique correctly and understand the design trade-offs when it is used. The first issue I had was choosing a simple design to apply the command to. One needs to start with a design that would naturally not use the CE pin, such that it is available for migrating logic. When using a design that does, and then trying to pursuade Vivado to move the logic to the D-input just so you can watch it migrate back with power optimisation does not really work. Here I borrow the design I used in Determining A Device's Maximum Clock Speed.
library ieee;
use ieee.std_logic_1164.all;
entity clock_speed is
generic (
clk_wiz_g : natural;
length_g : positive := 255
);
port (
clk_ext : in std_logic;
input : in std_logic;
output : out std_logic := '0'
);
end entity;
architecture rtl of clock_speed is
signal clk : std_logic := '0';
signal input_r : std_logic := '0';
signal vector : std_logic_vector(length_g-1 downto 0) := (others => '0');
begin
-- Artix-7 and Spartan-7 families
cw : if clk_wiz_g = 1 generate
-- MMCME2_ADV primitive
mmcm_i : entity work.clk_wiz_1
port map (
clk_in1 => clk_ext,
reset => '0',
clk_out1 => clk
);
else generate
-- MMCME3_ADV primitive
mmcm_i : entity work.clk_wiz_0
port map (
clk_in1 => clk_ext,
reset => '0',
clk_out1 => clk
);
end generate;
process(clk)
begin
if rising_edge(clk) then
input_r <= input;
(output, vector) <= (vector & input_r) XOR ('1' & vector);
end if;
end process;
end architecture;
There are two choices for when to perform additional power optimisations over and above the standard optimisations.
Running Power Optimization
Power optimization works on the entire design or on portions of the design (when set_power_opt is used) to minimize power consumption. Power optimization can be run pre-place or post-place in the design flow, but not in both places. The pre-place power optimization step focuses on maximizing power saving. This could result in timing degradation in rare cases. If preserving timing is the primary goal, the post-place power optimization step is the recommended option. This step performs only those power optimizations that preserve timing.
Vivado Design Suite User Guide: Power Analysis and Optimization (UG907)
In my example, I wish to maximise the power savings at the expense of timing in order to see the differences made. The process I'm using will include both synthesis and implementation.

This process allows the comparison of two synthesis and two implementation results before and after power optimisation. It also allows analysis of the differences between synthesis and implementation results. report_power states that the accuracy of the tool is not optimal until design is fully placed and routed (under "1.3 Confidence Level").
Results
# Work out the shortest clock period, assuming synthesis did not just stop when meeting the timing goal. This
# is the shortest clock period for the current design.
#
proc period_min_local {} {
set tp [get_timing_paths -from [get_cells {vector_reg[*]}] -to [get_cells {vector_reg[*]}] -nworst 1 -setup]
set setup [get_property SLACK $tp]
return [expr [get_property REQUIREMENT $tp] - $setup]
}
# Convert the shortest clock period into a maximum clock frequency.
#
proc fmax_local {} {
# MHz divide by ns * MHz => (1e-9 * 1e6) = 1e-3
return [expr 1e3/[period_min_local]]
}
# Perform a test run on the candidate design for the effects of power optimisation. Run in project mode as we
# only want the results not a scriptable method.
#
# Parameters:
# * popt : boolean. Should power optimisation be performed?
# * impl : boolean. Should synthesis be followed by implementation too?
#
proc test_run {popt {impl true}} {
reset_run synth_1
puts "Running synthesis"
launch_runs synth_1 -jobs 14
wait_on_runs synth_1
open_run synth_1
if {$popt} {
puts "Running power optimisation"
set rpt_name {With Power Optimisation}
power_opt_design
} else {
puts "Not running power optimisation"
set rpt_name {Without Power Optimisation}
}
# CE tied high with 2 input LUTs
report_power
puts "Synthesis $rpt_name"
puts "Period min (local) = [format "%.3f" [period_min_local]] ns"
puts "Fmax (local) = [format "%.2f" [fmax_local]] MHz"
report_timing -from [get_cells {vector_reg[*]}] -to [get_cells {vector_reg[*]}] -delay_type max -max_paths 1 -sort_by group -input_pins -routable_nets
report_timing -from [get_cells {vector_reg[*]}] -to [get_cells {vector_reg[*]}] -delay_type min -max_paths 1 -sort_by group -input_pins -routable_nets
show_schematic [list [get_ports *] [get_cells -hier *]] -name $rpt_name
report_timing -from [get_cells {vector_reg[*]}] -to [get_cells {vector_reg[*]}] -delay_type min_max -max_paths 10 -sort_by group -input_pins -routable_nets -name $rpt_name
report_power -name $rpt_name
if {$impl} {
puts "Running implementation"
launch_runs impl_1 -jobs 14
wait_on_runs impl_1
open_run impl_1
# CE tied high with 2 input LUTs
report_power
puts "Implementation $rpt_name"
puts "Period min (local) = [format "%.3f" [period_min_local]] ns"
puts "Fmax (local) = [format "%.2f" [fmax_local]] MHz"
report_timing -from [get_cells {vector_reg[*]}] -to [get_cells {vector_reg[*]}] -delay_type max -max_paths 1 -sort_by group -input_pins -routable_nets
report_timing -from [get_cells {vector_reg[*]}] -to [get_cells {vector_reg[*]}] -delay_type min -max_paths 1 -sort_by group -input_pins -routable_nets
show_schematic [list [get_ports *] [get_cells -hier *]] -name $rpt_name
report_timing -from [get_cells {vector_reg[*]}] -to [get_cells {vector_reg[*]}] -delay_type min_max -max_paths 10 -sort_by group -input_pins -routable_nets -name $rpt_name
report_power -name $rpt_name
}
}
test_run false
#test_run true
Prior to power optimisation the CE pin is tied to logic '1'.

Running power_opt_design has the desired effect of migrating logic to drive the CE input pin of each register as seen below.

The pertinent results of each of the four runs are tabulated next. The static timings are taken from across the shift register only and avoid timing information from the I/O.
Power Optimisation | Synthesis Power (W) | Implementation Power (W) | Synthesis (MHz) | Implementation (MHz) | Synthesis (ns) | Implementation (ns) |
---|---|---|---|---|---|---|
No | 1.606 | 1.169 | 1748.25 | 732.60 | 0.572 | 1.365 |
Yes | 1.072 | 1.138 | 1647.45 | 654.88 | 0.607 | 1.527 |
Power optimisation would initially appear to have a 33% reduction impact looking at the synthesis results. Then comparing the implementation results, this optimistic figure has been severely reduced to a mere 2.7% reduction after implementation. The static timing analysis results are also interesting in that they show that power optimisation has increased the minimum clock period between registers (6% after synthesis and nearly 12% after implementation). This is the trade I expected, but without the power savings I had hoped for. Part of the timing results can be explained by the static timing results after synthesis having negative slack on the hold time. This typically means that implementation will add delays to compensate (for more details see Notes on Fixing Hold Time Violations), thus reducing setup slack. Hence we should anticipate a reduction in clock period, but perhaps limited to more like 0.051 ns, the synthesis hold time violation value here.
On-Chip | No Power Opt (W) | Power Opt (W) | Saved (W) | Saved (%) |
---|---|---|---|---|
Clocks | 0.102 | 0.209 | -0.107 | -104.9 |
CLB Logic | 0.317 | 0.163 | 0.154 | 48.6 |
LUT as Logic | 0.264 | 0.136 | 0.128 | 48.5 |
> Register | 0.054 | 0.027 | 0.027 | 50.0 |
> Signals | 0.134 | 0.153 | -0.019 | -14.2 |
MMCM | 0.126 | 0.126 | 0 | 0.0 |
I/O | 0.002 | 0.001 | 0.001 | 50.0 |
Static Power | 0.487 | 0.487 | 0 | 0.0 |
Total | 1.169 | 1.138 | 0.031 | 2.7 |
Results Analysis
The first point I note is that the static power consumption is not little as the opening section led us to believe. Static power does increase with the number of gates in the design. This can be verified by altering the generic values used. Static power is required not just for logic, but for the RAM layer used to hold the FPGA bitstream once downloaded. Without a realistic design it is not possible to comment further.
These results show that much of the savings made in the logic were offset by the clocks. The textual power reports do not include the following figures which can be seen in the GUI only (as far as I can tell).


These figures show that the BELs have been distributed over a greater number of sites, 755 before and 2631 after optimisation. When examining the device layout, the density of packing logic into sites has been reduced, with gates spread over a wider area.
This is due to the change in control sets. report_control_sets reports that the number of control sets have increased from 1 to 10001. This means more sites need to be used to fit the logic and the clock signal now needs to drive more disparate logic and more sites. This effect would in all probably be less pronounced with a wider data bus (this design is 1-bit wide) where the clock enable is used on multiples of 4 registers. This analysis does indicate why power savings might not be a great as hoped, but without the increase in clocked slices the power saving would have been nearer 12%. These pictures also appear to cast doubt on previous work (Practical Control Set Reduction) looking at Vivado taking advantage of control sets. There is a modicum of slice packing occuring before optimisation and without a PBLOCK contraint, but it is not at all dense packing when zoomed out.
Looking at the differences in the static timing results for implementation before and after optimisation, shows that many of the differences are simply down to variation in timing properties of gates in difference sites, e.g. Setup_xFF_SLICEM_C_D and Prop_x6LUT_SLICEM_I1_O, or net delays from a different placement layout. These cannot be practically influenced. A big difference also came from "clock pessimism", the amount changed from 0.185 ns to 0.116 ns with power optimisation.
Clock Pessimism Removal
A typical timing path report shows the delay details of both source and destination clock paths, from their root to the sequential cell clock pins. […]
In many cases, the CPR accuracy changes before and after routing. For example, let's consider a timing path where the source and destination clocks are the same clock, and the startpoint and endpoint clock pins are driven by the same clock buffer.
Before routing, the common point is the clock net driver, that is, the clock buffer output pin. CPR compensates only for the pessimism from the clock root to the clock buffer output pin.
After routing, the common point is the last routing resource shared by the source and destination clock paths in the device architecture. This common point is not represented in the netlist, so the corresponding CPR cannot be directly retrieved by subtracting common clock circuitry delay difference from the timing report. The timing engine computes the CPR value based on device information not directly exposed to the user.
Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906)
In these results the CPR is being added to the required time. This means less is being added to the required time after power optimisation when also the changes in logic and net delays means the arrivial time has increased. This is a two fold setback for the slack. User Guide 906 provides no explanation as to how clock pessimism is derived as the timing engine computes the CPR value based on device information not directly exposed to the user.
A final note on switching logic between CE and D pins. We can extract timing information from the library with the following commands post synthesis:
get_property DELAY_SLOW_MAX_RISE [get_timing_arcs -to [get_pin {vector_reg[9424]/CE}] -filter {TYPE == "setup"}]
0.047
get_property DELAY_SLOW_MAX_RISE [get_timing_arcs -to [get_pin {vector_reg[9424]/D}] -filter {TYPE == "setup"}]
-0.059
These values are device dependent, but still illustrate a source of difference in static timing results. Typically a LUT feeds the D pin adding a primitive delay, but the following quote explains...
Pushing the Logic from the Control Pin to the Data Pin
During analysis of critical paths, you might find multiple paths ending at control pins. You must analyze these paths to determine if there is a way to push the logic into the datapath without incurring penalties, such as extra logic levels. There is less delay in a path to the D pin than CE/R/S pins given the same levels of logic because there is a direct connection from the output of the last LUT to the D input of the FF. The following coding examples show how to push the logic from the control pin to the data pin of a register.
UltraFast Design Methodology Guide for FPGAs and SoCs (UG949)
Conclusions
- Power optimisation moves logic from register D inputs to CE inputs to prevent switching. I know the manual says it does, but confirmation it can be used correctly builds confidence.
- The number of control sets can be increased as a result (design dependent).
- Increasing control sets spreads the logic, increasing power due to driving clocks to more sites, hence power savings can be reduced.
- Timing is adversely affected by the use of CE pins over D pins, and changes to clock pessimism.