There are several parts of working with AXI streaming data handshakes that have cause me problems, starting from when I was completely naïve and still cause me grief today. Initially it was about setting up a realistic and stressful simulation, then having stressed the functional simulation there was always a failure mode, fixing one problem caused another. Initial simple problems included pipelining a multiplexer to select between two AXI streams. Although this was quickly resolved once the AXI Register Slice in the Xilinx IP catalogue had been pointed out. Later, the need to pause a data stream when the decision to lower the ready signal for back pressure was made late just meant running the data into an AXI FIFO from the IP catalogue, so the data would buffer up, and be read out in an AXI-compliant way. But what about more complicated examples where you have to get dirty with all the signalling in VHDL? Here are a few pointers, mainly for my own benefit to be honest.
- AXI Shift Register
- Simple Shift Register
- Enhanced Shift Register
- Comparison of Simple and Enhanced AXI Streaming Shift Registers
- Unregistered 'ready'
- Pipelined Hybrid
- Notes on ModelSim Generics
- Pausing an AXI Data Stream
- FSM Application
- Conclusions
- References
AXI Shift Register
There's a basic unit of logic for a single clock cycle delay or AXI register. Xilinx also have their own "AXI Register Slice" IP Core which is in effect a FIFO. This gives you two options for how to pipeline AXI handshake logic, e.g. a stream multiplexer for a selectable path. The AXI Register Slice will be a more fully fledged solution with registered outputs where as the design used here does not register the back pressure on the ready line, but will be cheaper in resources. As always there's a trade, this is about presenting a cheaper alternative to the AXI Register Slice when you need to work at the RTL code level.
AXI4-Stream Register Slice
The register slice is a multipurpose pipeline register that is able to isolate timing paths between master and slave. The register slice is designed to trade-off timing improvement with area and latency necessary to form a protocol compliant pipeline stage. Implemented as a two-deep FIFO buffer by default, the register slice supports throttling by the master (channel source) and/or slave (channel destination) as well as back-to-back transfers without incurring unnecessary idle cycles. The module can be independently instantiated at all port boundaries. A configuration parameter allows for the trade off of performance vs. area efficiency, including a mode that adds extra pipeline stages to optimally cross super logic regions (SLR) boundaries in stacked silicon interconnect (SSI) devices.
AXI4-Stream Infrastructure IP Suite v3.0, PG085 November 17, 2021

This is the basic unit of delay we'll work with. Multiple stages will be added in series below to build up a shift register.
Simple Shift Register

Note in this first version the ready line has a single OR gate (or LUT) to control all chip enables.
library ieee;
use ieee.std_logic_1164.all;
entity axi_delay is
generic(
delay_g : positive;
data_width_g : positive
);
port(
clk : in std_logic;
s_axi_data : in std_logic_vector(data_width_g-1 downto 0);
s_axi_valid : in std_logic;
s_axi_ready : out std_logic;
m_axi_data : out std_logic_vector(data_width_g-1 downto 0);
m_axi_valid : out std_logic;
m_axi_ready : in std_logic
);
end entity;
architecture simple of axi_delay is
type delay_reg_t is array(delay_g-1 downto 0) of std_logic_vector(data_width_g-1 downto 0);
signal data_reg : delay_reg_t := (others => (others => '0'));
signal valid_reg : std_logic_vector(delay_g-1 downto 0) := (others => '0');
begin
process(clk)
begin
if rising_edge(clk) then
if s_axi_ready = '1' then
data_reg <= s_axi_data & data_reg(delay_g-1 downto 1);
valid_reg <= s_axi_valid & valid_reg(delay_g-1 downto 1);
end if;
end if;
end process;
m_axi_data <= data_reg(0);
m_axi_valid <= valid_reg(0);
s_axi_ready <= m_axi_ready or not valid_reg(0);
end architecture;
Enhanced Shift Register
The code in this section has been shamelessly plagiarised from another source. An ITDev blog post called "Register ready signals in low latency, zero bubble pipeline" provides a neat solution to pipelining an AXI delay stage. The source code is provided on GitHub by the author Tom Jackson and used here for an experiment.

This time there is an OR gate per delay stage in the ready line, with individually crafted chip enable lines.
architecture itdev of axi_delay is
type delay_reg_t is array(delay_g-1 downto 0) of std_logic_vector(data_width_g-1 downto 0);
signal data_reg : delay_reg_t := (others => (others => '0'));
signal valid_reg : std_logic_vector(delay_g-1 downto 0) := (others => '0');
signal ready_stage : std_logic_vector(delay_g-1 downto 0) := (others => '0');
begin
process(clk)
begin
if rising_edge(clk) then
for i in delay_g-2 downto 0 loop
if ready_stage(i) = '1' then
data_reg(i) <= data_reg(i+1);
valid_reg(i) <= valid_reg(i+1);
end if;
end loop;
if ready_stage(delay_g-1) = '1' then
data_reg(delay_g-1) <= s_axi_data;
valid_reg(delay_g-1) <= s_axi_valid;
end if;
end if;
end process;
m_axi_data <= data_reg(0);
m_axi_valid <= valid_reg(0);
ready_gen : for i in delay_g-1 downto 1 generate
ready_stage(i) <= ready_stage(i-1) or not valid_reg(i);
end generate;
ready_stage(0) <= m_axi_ready or not valid_reg(0);
s_axi_ready <= ready_stage(ready_stage'high);
end architecture;
Firstly the unpipelined version has ready logic which is different, so I wanted to see how this version compared. Secondly, a pipelined version is provided, which begs the question about how frequently such a pipeline stage needs to be included in the AXI shift register.
Comparison of Simple and Enhanced AXI Streaming Shift Registers

The measured delay is as requested or one less based on the fact the handshakes at each end are not synchronised. The enhanced version blogged on the ITDev website has a higher throughput by virtue of not exerting back pressure so often. Some might argue the benefit here is only a minor improvement based on the simulation waveform shown above. It is worth noting that in the face of constantly valid data and being always ready, the throughput of both are the same, the differences only show when valid and ready get de-asserted, and hence the amount of difference in the throughput between the two designs will be dependent on the burstiness of the handshakes.
Unregistered 'ready'

The figure above shows how the ready logic steadily increases in depth, in this example the LUT depth is 7 for a 36-bit AXI shift register. Because each stage requires a unique tap on the ready logic for the chip enable pin, the options for packing logic in the LUTs are more limited. Hence to maintain the higher throughput it may be desirable to pipeline the ready line.
Pipelined Hybrid
The thought here is that not every stage of the AXI shift register needs to be pipelined, and there is resource efficiency to be gained by judicious choice of placement of the pipelined stages titrated against the timing requirement. The plan is to provide a vector such as "100100" that would define a 6-bit shift register with two pipeline stages chosen so that the ready line running right to left leaves (almost registered) and the pipeline spacing is defined as 3, i.e. pipelined every 3rd stage. The reality is there is a small amount of residual logic (depth of 1 LUT) on the ready output.
library ieee;
use ieee.std_logic_1164.all;
entity axi_delay_stage is
generic(
data_width_g : positive
);
port(
clk : in std_logic;
-- Upstream interface
us_valid : in std_logic;
us_data : in std_logic_vector(data_width_g-1 downto 0);
us_ready : out std_logic := '0';
-- Downstream interface
ds_valid : out std_logic := '0';
ds_data : out std_logic_vector(data_width_g-1 downto 0) := (others => '0');
ds_ready : in std_logic
);
end entity;
architecture rtl_basic of axi_delay_stage is
begin
process(clk) is
begin
if rising_edge(clk) then
-- Accept data if ready is high
if us_ready = '1' then
ds_valid <= us_valid;
ds_data <= us_data;
end if;
end if;
end process;
-- Ready signal with registered ready or primary data register is not valid
us_ready <= ds_ready or not ds_valid;
end architecture;
architecture rtl_pipe of axi_delay_stage is
-- Expansion registers
signal expansion_data_reg : std_logic_vector(data_width_g-1 downto 0) := (others => '0');
signal expansion_valid_reg : std_logic := '0';
-- Standard registers
signal primary_data_reg : std_logic_vector(data_width_g-1 downto 0) := (others => '0');
signal primary_valid_reg : std_logic := '0';
begin
process(clk) is
begin
if rising_edge(clk) then
-- Accept data if ready is high
if us_ready = '1' then
primary_valid_reg <= us_valid;
primary_data_reg <= us_data;
-- when ds is not ready, accept data into expansion reg until it is valid
if ds_ready = '0' then
expansion_valid_reg <= primary_valid_reg;
expansion_data_reg <= primary_data_reg;
end if;
end if;
-- When ds becomes ready the expansion reg data is accepted and we must clear the valid register
if ds_ready = '1' then
expansion_valid_reg <= '0';
end if;
end if;
end process;
-- Ready as long as there is nothing in the expansion register
us_ready <= not expansion_valid_reg;
-- Selecting the expansion register if it has valid data
ds_valid <= expansion_valid_reg or primary_valid_reg;
ds_data <= expansion_data_reg when expansion_valid_reg else primary_data_reg;
end architecture;
library ieee;
use ieee.std_logic_1164.all;
entity axi_delay_mixed is
generic(
delay_vector_g : std_logic_vector; -- Any '1' bit gives the pipelined version, otherwise non-pipelined.
data_width_g : positive
);
port(
clk : in std_logic;
s_axi_data : in std_logic_vector(data_width_g-1 downto 0);
s_axi_valid : in std_logic;
s_axi_ready : out std_logic;
m_axi_data : out std_logic_vector(data_width_g-1 downto 0);
m_axi_valid : out std_logic;
m_axi_ready : in std_logic
);
end entity;
architecture structural of axi_delay_mixed is
constant delay_c : positive := delay_vector_g'length;
-- Make sure we know the range of this generic for indexing. Aliases cannot be used.
constant delay_vector_c : std_logic_vector(delay_c-1 downto 0) := delay_vector_g;
type delay_reg_t is array(delay_c downto 0) of std_logic_vector(data_width_g-1 downto 0);
signal data_stage : delay_reg_t := (others => (others => '0'));
signal valid_stage : std_logic_vector(delay_c downto 0) := (others => '0');
signal ready_stage : std_logic_vector(delay_c downto 0) := (others => '0');
begin
m_axi_valid <= valid_stage(0);
m_axi_data <= data_stage(0);
ready_stage(0) <= m_axi_ready;
valid_stage(delay_c) <= s_axi_valid;
data_stage(delay_c) <= s_axi_data;
s_axi_ready <= ready_stage(delay_c);
delay_g : for i in delay_vector_c'range generate
pipe_g : if delay_vector_c(i) = '1' generate
axi_delay_stage_i : entity work.axi_delay_stage(rtl_pipe)
generic map (
data_width_g => data_width_g
)
port map (
clk => clk,
us_valid => valid_stage(i+1),
us_data => data_stage(i+1),
us_ready => ready_stage(i+1),
ds_valid => valid_stage(i),
ds_data => data_stage(i),
ds_ready => ready_stage(i)
);
else generate
axi_delay_stage_i : entity work.axi_delay_stage(rtl_basic)
generic map (
data_width_g => data_width_g
)
port map (
clk => clk,
us_valid => valid_stage(i+1),
us_data => data_stage(i+1),
us_ready => ready_stage(i+1),
ds_valid => valid_stage(i),
ds_data => data_stage(i),
ds_ready => ready_stage(i)
);
end generate;
end generate;
end architecture;
It is worth noting that in simulation the addition of a pipeline stage into the shift register can cause the delay line to absorb one more data value than request in the delay_vector_g generic. This is absorbed into the expansion register. Adding a high density of pipeline stages can absorb additional values, but spreading them out tends the excess back to one at most.
From a series of synthesis results using an xc7k70tfbv676-1 part, varying the pipeline spacing I was able to build the following chart. As Vivado is goal driven, a target clock speed of 400 MHz was used in the out of context constraints. The measured slack was then converted back to an "Fmax".

For a 16-bit pipeline spacing, the logic depth is 4 resulting in a just over 400 MHz clock speed. The results are just indicative of what can be achieved, and clearly a pipeline spacing of 16-bits gives good results without doubling up every register to be paired with an expansion register at each delay stage.
Notes on ModelSim Generics
Not knowing this you might try to specify the value of the std_logic_vector generic as a VHDL vector e.g. "10..00". This will cause Vivado to steadily use up all your PC's memory until the application crashes. It turns out from a support topic "vivado pass a GENERIC std_logic_vector from tcl script", that you must use Verilog vector syntax. A bit awkward for those unfamiliar with Verilog like me. Here's an example, but you could use "6'b100100" to specify in binary. There is clearly no syntax checking on the values supplied to generics in Vivado, which I find unhelpful when you consider the tool can parse both VHDL and Verilog languages.
set_property generic {delay_vector_g=32'h80008000 data_width_g=1} [current_fileset]
Pausing an AXI Data Stream

The key point is to gate the valid and ready lines in such a way as to not violate the behaviour conditions allowed by the AXI standard on the valid line. Once valid goes high it must not go low until ready has been high and read the data value. I've called this an "AXI Pause" stage, but called the control input "enable" simply because of the simplicity of adding an AND gate into the logic.
library ieee;
use ieee.std_logic_1164.all;
entity axi_pause is
generic(
data_width_g : positive
);
port(
clk : in std_logic;
s_axi_data : in std_logic_vector(data_width_g-1 downto 0);
s_axi_valid : in std_logic;
s_axi_ready : out std_logic := '0';
enable : in std_logic;
m_axi_data : out std_logic_vector(data_width_g-1 downto 0) := (others => '0');
m_axi_valid : out std_logic := '0';
m_axi_ready : in std_logic
);
end entity;
architecture rtl of axi_pause is
begin
s_axi_ready <= (m_axi_ready or not m_axi_valid) and enable;
process(clk)
begin
if rising_edge(clk) then
if m_axi_ready = '1' or m_axi_valid = '0' then
m_axi_data <= s_axi_data;
m_axi_valid <= s_axi_valid and enable;
end if;
end if;
end process;
end architecture;

FSM Application
One of the places I have struggled to work with AXI is when a finite state machine needs to be used to control data in the stream. Typically the AXI handshakes and timing of data fail under intense stressing, e.g. with OSVVM. Presently I'm using my own style of stimulus which remains sufficient if not as efficient and properly metricated as OSVVM. The enable signal then provides a focus around which to arrange timing of data modifications. I also use the Mealy form of state machine because they react faster to the inputs, i.e. the outputs or actions are assigned on "arcs" or "edges" between states and are in effect by the time you enter the next state rather than being a clock cycle after entering (Moore), and hence less tardy.
Conclusions
These are just some basic techniques I have been using to make my professional life working with AXI handshakes easier. I'm just keen to reduce the amount of time I have spent battling with VHDL code and simulations when performing non-trivial manipulations of data streams such as adding headers and trailers to datagrams of content. With thanks to colleagues for volunteering their solutions.
References
2 comments
Comment from: Erik Visitor

Comment from: philip Member

Hi Erik,
The line of code you point to is a pure signal (wire) assignment outside a clocked process and hence there’s no register on that line. The ready signal takes the output from the register on line 68. It then inverts it before it leaves the component. Strictly speaking that’s not purely registered, but for most cases it is good enough. If you want a fully registered solution take a look at the AXI Register Slice from Xilinx.
I’ve referred to the ITDev blog post many times when working with AXI-S. I use their non-registered solution all the time as it is so simple. I’m glad to have been of use, but I can’t take credit for the original article.
Philip
Hi Philip,
Let me thank you for such a wonderful post. I’m starting to battle with managing upstream and downstream axi connections and I felt like missing something, and this post has just nailed it.
Just a quick question: In the pipelined solution, you say you’re registering the
ready
signal. Does that occur in line 82 whereus_ready <= not expansion_valid_reg;
, becauseexpansion_valid_reg
is assigned synchronously?