- Xilinx's Claims Under Test
- RAM Inferencing
- Measuring RAM Performance
- Limiting The Cascade
- Automating Results Production
- Conclusions
- Pointers on RAM Inferencing
- References
Xilinx's Claims Under Test
"UltraScale architecture-based devices provide the capability to cascade data out from one RAMB36 to the next RAMB36 serially to make a deeper block RAM in a bottom-up fashion. The data out cascading feature is supported for all RAMB36 port widths. The block RAM cascade supports all the features supported by the RAMB36E2 module."
"The data outputs of the lower to upper adjacent block RAMs can be cascaded to build large block RAM blocks. Optional pipeline registers are available to support maximum performance."
"RECOMMENDED: The output data path has an optional internal pipeline register. Using the register mode is strongly recommended. This allows a higher clock rate. However, it adds a clock cycle latency of one."
Ref: UG573 UltraScale Architecture Memory Resources

Extract from "UltraScale Architecture Memory Resources, UG573" (v1.12) March 17, 2021
RAM Inferencing
In order to evaluate the performance of cascaded Block RAMs I need an easy way to create large memories. Vivado already provides a template for RAM inference. Using RAM inferencing is the really easy way to infer larger memories without needing to use hand crafted structural VHDL, with synthesis building quite sensible structures out of smaller Block RAMs. So much work one does not need to do. The following is based on the template suggested by Vivado, but I like to think my version is better because for example why did we need a ceil(log2(..)) function when you can just specify the number of bits in the addresses?
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity my_ram is
generic (
-- Note: If the chosen data and address width values are low, Synthesis will infer Distributed RAM.
ram_width_g : integer := 36; -- Specify RAM data width
ram_addr_g : integer := 15; -- Specify RAM address width (number of entries = 2**ram_addr_g)
output_register_g : boolean := true -- True for higher clock speed or false for lower latency
);
port (
addra : in std_logic_vector(ram_addr_g-1 downto 0); -- Port A Address bus, width determined from RAM_DEPTH
addrb : in std_logic_vector(ram_addr_g-1 downto 0); -- Port B Address bus, width determined from RAM_DEPTH
dina : in std_logic_vector(ram_width_g-1 downto 0); -- Port A RAM input data
dinb : in std_logic_vector(ram_width_g-1 downto 0); -- Port B RAM input data
clka : in std_logic; -- Port A Clock
clkb : in std_logic; -- Port B Clock
wea : in std_logic; -- Port A Write enable
web : in std_logic; -- Port B Write enable
ena : in std_logic; -- Port A RAM Enable, for additional power savings, disable port when not in use
enb : in std_logic; -- Port B RAM Enable, for additional power savings, disable port when not in use
rsta : in std_logic; -- Port A Output reset (does not affect memory contents)
rstb : in std_logic; -- Port B Output reset (does not affect memory contents)
regcea : in std_logic; -- Port A Output register enable
regceb : in std_logic; -- Port B Output register enable
douta : out std_logic_vector(ram_width_g-1 downto 0); -- Port A RAM output data
doutb : out std_logic_vector(ram_width_g-1 downto 0) -- Port B RAM output data
);
end entity;
architecture inferred of my_ram is
-- Xilinx True Dual Port RAM No Change Dual Clock
-- This code implements a parameterizable true dual port memory (both ports can read and write).
-- This is a no change RAM which retains the last read value on the output during writes
-- which is the most power efficient mode.
-- If a reset or enable is not necessary, it may be tied off or removed from the code.
signal ram_data_a : std_logic_vector(ram_width_g-1 downto 0);
signal ram_data_b : std_logic_vector(ram_width_g-1 downto 0);
-- 2D Array Declaration for RAM
type ram_type is array((2**ram_addr_g)-1 downto 0) of std_logic_vector(ram_width_g-1 downto 0);
-- Define RAM - Do not use VHDL Protected types for synthesis!
-- ERROR: [Synth 8-6750] Unsupported VHDL type protected. This is not suited for Synthesis [.../inferred_ram.vhdl:xx]
-- WARNING: [Synth 8-4747] shared variables must be of a protected type [.../inferred_ram.vhdl:xx]
shared variable ram : ram_type := (others => (others => '0'));
begin
process(clka)
begin
if rising_edge(clka) then
if ena = '1' then
if wea = '1' then
ram(to_integer(unsigned(addra))) := dina;
else
ram_data_a <= ram(to_integer(unsigned(addra)));
end if;
end if;
end if;
end process;
process(clkb)
begin
if rising_edge(clkb) then
if enb = '1' then
if web = '1' then
ram(to_integer(unsigned(addrb))) := dinb;
else
ram_data_b <= ram(to_integer(unsigned(addrb)));
end if;
end if;
end if;
end process;
output_register : if output_register_g generate
-- Following code generates HIGH_PERFORMANCE (use output register)
-- Following is a 2 clock cycle read latency with improved clock-to-out timing
process(clka)
begin
if rising_edge(clka) then
if rsta = '1' then
douta <= (others => '0');
elsif regcea = '1' then
douta <= ram_data_a;
end if;
end if;
end process;
process(clkb)
begin
if rising_edge(clkb) then
if rstb = '1' then
doutb <= (others => '0');
elsif regceb = '1' then
doutb <= ram_data_b;
end if;
end if;
end process;
else generate
-- Following code generates LOW_LATENCY (no output register)
-- Following is a 1 clock cycle read latency at the cost of a longer clock-to-out timing
douta <= ram_data_a;
doutb <= ram_data_b;
end generate;
end architecture;
This is what a large cascaded group of BlockRAMs looks like when creating larger memories (address width of 15 bits).

There are two columns of registers on the outputs that did not get packed into the BlockRAMs by Vivado. This is not always the case. The main point to note about this RTL code is the mismatch with the Xilinx documentation in UG573.
Measuring RAM Performance
We'll calculate a maximum clock speed from the worst-case slack as follows in TCL in Vivado.
set maxSetup [get_property SLACK \
[get_timing_paths -max_paths 1 -nworst 1 -setup]]
set maxClkPeriod [expr [get_property REQUIREMENT \
[get_timing_paths -max_paths 1 -nworst 1 -setup]] - $maxSetup]
# MHz
set maxClkFreq [expr 1/($maxClkPeriod * 1e-9) / 1e6]
We'll use the RAM inferencing VHDL at the end of this document and take a smaller test case that uses just 4 Block RAMs. We'll initially assume we're aiming for performance, so we want the second optional pipeline stage for the extra clock cycle latency. So, given the VHDL for the RAM inferencing we have requested the output register for performance, an address width of 12 bits, data width of 36 bits (the RAMB36E2 primitives are naturally 36-bits wide. Using fewer bits means the synthesis tool gets clever with using fewer Block RAMs, stuffing bit in funny places and messing up the picture of the simple cascade.
set_property generic \
{ram_width_g=36 ram_addr_g=12 output_register_g=true} \
[current_fileset]
The following two pictures illustrate the two different structures synthesised by Vivado under different conditions. Then the tables that follow indicated indicated when each structure is synthesied.

Note: An additional 24 LUTs are required for this design (above) over the non-cascaded version (below) to manage all the read and write enables.

The following tables provide the results gained from Vivado version 2019.1.1 using an FPGA Part xcku035-sfva784-3-e just as an example. The row "Data Out Cascade" in each of the following tables indicates which structure was synthesised. The only factor being altered in each case was the requested clock period (or clock frequency if you prefer). The tables have column headings by clock frequency as that's more human readable.
Requested clock frequency (MHz) | 250.0 | 400.0 | 454.5 | 476.2 | 500.0 |
---|---|---|---|---|---|
Requested clock period (ns) | 4.0 | 2.5 | 2.2 | 2.1 | 2.0 |
Actual clock period (ns) | 1.972 | 1.972 | 1.972 | 1.972 | 1.027 |
Actual clock frequency (MHz) | 507.1 | 507.1 | 507.1 | 507.1 | 973.7 |
Output Register | TRUE | TRUE | TRUE | TRUE | TRUE |
Data Out Cascade | Y | Y | Y | Y | N |
These results tell us that RAM inferencing prefers to cascade the BlockRAMs when timing is not tight, and when it is (or demand for timing is high) it will simply wire them together at the inputs and outputs.
Requested clock frequency (MHz) | 250.0 | 400.0 | 454.5 | 476.2 | 500.0 |
---|---|---|---|---|---|
Requested clock period (ns) | 4.0 | 2.5 | 2.2 | 2.1 | 2.0 |
Actual clock period (ns) | 2.495 | 1.635 | 1.635 | 1.635 | 1.635 |
Actual clock frequency (MHz) | 400.8 | 611.6 | 611.6 | 611.6 | 611.6 |
Output Register | FALSE | FALSE | FALSE | FALSE | FALSE |
Data Out Cascade | Y | N | N | N | N |
When the second stage of pipelining is removed from the output registers, there's a not entirely unexpected reduction in clock speed, and the threshold for the switch in structural configuration occurs at a lower clock speed.
Limiting The Cascade
A colleague pointed out an advertised way of limiting the length of any Block RAM cascade. This can be achieved using the following options in Vivado.
set_property STEPS.SYNTH_DESIGN.ARGS.MAX_BRAM_CASCADE_HEIGHT 2 [get_runs synth_1]
# OR
synth_design -max_bram_cascade_height 2 <options>...
# OR better in XDC where the application can be precisely selected
set_property cascade_height 2 [get_cells ram_reg]
The cascade_height attribute can be applied either via VHDL attributes
or XDC constraints, and the documentation can be found in UG901 Synthesis under "Synthesis Attributes". I very much favour the fine grained or precisely selected application within the XDC constraints file here rather a blanket "apply to all approach". I also favour constraints over attributes so one project's specific requirements are not mixed up with re-usable code.

Here the use of a synthesis option is able to limit the length of the cascade to 2 BlockRAMs. Setting the option to 0 gives the non-cascaded implementation. Given my timing results above this may well be the preferable default option. For the record, the partially cascaded design runs at 537.3 MHz, nicely between the slowest for full cascade of 507.1 MHz and the fastest for no cascade of 973.7 MHz.
Automating Results Production
# source -notrace {/path/this_script.tcl}
proc reportClkSpeed {} {
# Check for setup violations (-delay_type max)
set maxSetup [get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup]]
puts "Setup: Get max delay timing path (ns): $maxSetup"
# Check for hold violations (-delay_type min)
puts -nonewline "Hold: Get min delay timing path (ns): "
puts [get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -hold]]
set maxClkPeriod [expr [get_property REQUIREMENT [get_timing_paths -max_paths 1 -nworst 1 -setup]] - $maxSetup]
set maxClkFreq [expr 1/($maxClkPeriod * 1e-9) / 1e6]
# Alter to 1 decimal place
set maxClkFreq [expr {double(round($maxClkFreq * 10)) / 10}]
puts "Maximum clock period (ns): [format "%0.3f" $maxClkPeriod]"
puts "Maximum clock frequency (MHz): $maxClkFreq"
}
proc calcClkSpeed {} {
# Check for setup violations (-delay_type max)
set maxSetup [get_property SLACK [get_timing_paths -max_paths 1 -nworst 1 -setup]]
set maxClkPeriod [expr [get_property REQUIREMENT [get_timing_paths -max_paths 1 -nworst 1 -setup]] - $maxSetup]
set maxClkFreq [expr 1/($maxClkPeriod * 1e-9) / 1e6]
# Alter to 1 decimal place
set maxClkFreq [expr {double(round($maxClkFreq * 10)) / 10}]
return $maxClkFreq;
}
set_property STEPS.SYNTH_DESIGN.ARGS.MAX_BRAM_CASCADE_HEIGHT 2 [get_runs synth_1]
if {[get_property PROGRESS [get_runs synth_1]] == "100%"} {
reset_run synth_1
}
set_property generic \
{ram_width_g=36 ram_addr_g=12 output_register_g=true} \
[current_fileset]
# RTL Elaboration
synth_design \
-rtl \
-name rtl_1 \
-top my_ram \
-mode out_of_context
# Synthesis
launch_runs synth_1 -jobs 6
wait_on_run synth_1
open_run synth_1 -name synth_1
# Get timing information
report_timing_summary \
-delay_type min_max \
-report_unconstrained \
-check_timing_verbose \
-max_paths 10 \
-input_pins \
-routable_nets \
-name timing_1
reportClkSpeed
Above is a TCL script to produce results for the tables in this blog, just amend the generic values required and source it.
Conclusions
In short:
- The RAM inferencing works well.
- The cascading of signals through the internal pipeline logic is counterproductive.
- The results can be improved by demanding a faster clock speed causing Vivado to rearrange the design away from cascading and into a faster configuration with slightly less logic.
More detail:
- Vivado does not deliver on its promise. You get faster structures without using any internal paths made available via the cascade pins.
- You would not expect any pipelining within a series of multiplexers anyway, that would upset function timing of values.
- Decoding serially will always give a delay proportional to the number of Block RAMs in series. Other options are prevented by one multiplexer input being ties to the Block RAM data outputs.
- A tree would be a better structure to decode as it has a delay proportional to the depth of the tree, i.e. logp(number of Block RAMs), where p is some power of two for the number of inputs to the multiplexer (2, 4, 8…). Serial decoding was always a poor plan.
- Cascading the Block RAMs is a distraction as that configuration only works for lower clock speeds.
- For the fastest configurations use VHDL RAM inferencing and let the synthesis tool choose the faster implementation.
Pointers on RAM Inferencing
- Vivado will warn about the use of shared variables and remind us to use protected types.
- Don't use a protected type as Vivado now issues an error.
- Xilinx Forums suggest using the file type VHDL instead of VHDL 2008 as protected types were introduced in VHDL-2002.
- This did not work as it fails to understand a protected type (which was defined in the architecture rather than a package if that makes any difference).
- You will also have to revert to the old style generate statements for the optional output register rather than using VHDL-2008 style if..the..else..end generate.
- Just stick with a shared variable and ignore the warning like we always do!