When working with AXI data streams, it is quite possible that operations will need to be applied in parallel paths, or two parallel paths need to join at an operation, e.g. a comparator. The code presented here is very easy, but its useful to have as a reference for copy & paste to avoid having to verify the solution again.
Splitter
library ieee;
use ieee.std_logic_1164.all;
entity axi_split is
generic(
data_width_g : positive
);
port(
s_axi_data : in std_logic_vector(data_width_g-1 downto 0);
s_axi_valid : in std_logic;
s_axi_ready : out std_logic := '0';
m1_axi_data : out std_logic_vector(data_width_g-1 downto 0) := (others => '0');
m1_axi_valid : out std_logic := '0';
m1_axi_ready : in std_logic;
m2_axi_data : out std_logic_vector(data_width_g-1 downto 0) := (others => '0');
m2_axi_valid : out std_logic := '0';
m2_axi_ready : in std_logic
);
end entity;
architecture rtl of axi_split is
signal backpressure : std_logic;
begin
m1_axi_data <= s_axi_data;
m1_axi_valid <= s_axi_valid and m2_axi_ready;
m2_axi_data <= s_axi_data;
m2_axi_valid <= s_axi_valid and m1_axi_ready;
s_axi_ready <= m1_axi_ready and m2_axi_ready;
-- NB. Invert this logic when using in an assert statement.
backpressure <= s_axi_valid and not s_axi_ready;
end architecture;
Both outputs must be ready to receive before the input can be acknowledged and consumed. This means that both ready inputs from the two up stream components must be high before the ready output to the down stream component can be raised. To prevent consumption on the ready output until both are ready, the respective valid output must be gated low.
Some applications are sensitive to back pressure, either for timely data delivery, or because of known bugs coded in the AXI-Stream paths. An internal signal is derived to detect back pressure based on the condition that no valid data should be kept waiting by a ready signal. This is for demonstration only. N.B. In order to convert this condition into an assert statement, the condition needs to be inverted.
Joiner
library ieee;
use ieee.std_logic_1164.all;
entity axi_join is
generic(
data_width_g : positive
);
port(
clk : in std_logic;
s1_axi_data : in std_logic_vector(data_width_g-1 downto 0);
s1_axi_valid : in std_logic;
s_axi_ready : out std_logic := '0'; -- For both ports 1 & 2
s2_axi_data : in std_logic_vector(data_width_g-1 downto 0);
s2_axi_valid : in std_logic;
m_axi_data : out std_logic_vector(2*data_width_g-1 downto 0) := (others => '0');
m_axi_valid : out std_logic := '0';
m_axi_ready : in std_logic
);
end entity;
architecture rtl_reg of axi_join is
signal backpressure1 : std_logic;
signal backpressure2 : std_logic;
begin
process(clk)
begin
if rising_edge(clk) then
if s_axi_ready = '1' then
m_axi_data <= s1_axi_data & s2_axi_data;
m_axi_valid <= s1_axi_valid and s2_axi_valid;
elsif m_axi_ready = '1' and m_axi_valid = '1' then
m_axi_valid <= '0';
end if;
end if;
end process;
s_axi_ready <= (m_axi_ready or not m_axi_valid) and s1_axi_valid and s2_axi_valid;
-- The following work in simulation too, but without the 'or not m_axi_valid' clause takes a few extra
-- clock cycles, i.e. lower throughput:
-- s_axi_ready <= m_axi_ready and s1_axi_valid and s2_axi_valid;
-- synthesis translate_off
-- NB. Invert this logic when using in an assert statement.
backpressure1 <= s1_axi_valid and not s_axi_ready after 1 ps;
backpressure2 <= s2_axi_valid and not s_axi_ready after 1 ps;
-- synthesis translate_on
end architecture;
architecture rtl_comb of axi_join is
signal backpressure1 : std_logic;
signal backpressure2 : std_logic;
begin
m_axi_valid <= s1_axi_valid and s2_axi_valid;
s_axi_ready <= s1_axi_valid and s2_axi_valid and m_axi_ready;
m_axi_data <= s1_axi_data & s2_axi_data;
-- synthesis translate_off
-- NB. Invert this logic when using in an assert statement.
backpressure1 <= s1_axi_valid and not s_axi_ready after 1 ps;
backpressure2 <= s2_axi_valid and not s_axi_ready after 1 ps;
-- synthesis translate_on
end architecture;
Both inputs must present valid data before the ready line can acknowledge consumption of the data on the inputs. This splitter simply concatenates a word from each input into a single wider output word. The first solution is based on a single AXI-S delay, modified to cope with two separate incoming data valid signals. The second is purely combinatorial logic. Both solutions have an unregistered ready signal.
Ensemble

The plan here is to demonstrate both in action by splitting a basic stream of numbers to feed two different loads, and joining them again so that the combined data should contain two copies of the same number and successive data values produce a numerically increasing sequence. For the loads I have taken the AXI-S delay or shift register component which comes in two flavours, simple and itdev each with slightly different behaviours. The simple architecture is assigned to load 1 and the itdev architecture is assigned to load 2. The simple architecture has a slightly lower throughput to the itdev one, so that the itdev architecture will be slowed do to match the simple one. The back pressure at the sink should genrally match the back pressure at the source since the simple path is the rate determining step, and the joiner synchronises the two streams. The exceptions to this are when the loads introduce dead cycles.

The simulation results above show this to be true with three exceptions explained by the sink not being ready to receive data. On two of those occaisions, the slower load took the opportunity to catch up with the faster load.
Conclusions
These are not tricky problems to solve, but AXI is never trivial. Those who underestimate what it takes to get even the simplest AXI, AXI-Streaming, correct will make mistakes. Hence pre-tested solutions like these are always useful to avoide repeating one's mistakes anew. I would not personally bother to retain the componentised format of these solutions but instead copy & paste them where I need them, e.g. as concurrent signal assignments or a process plus assignment.