Investigating Xilinx AXI IP and Registered Outputs
My own RTL versions of AXI split and join components did not register the tready outputs. How does Xilinx's own IP manage the registering? According to the AMBA AXI Protocol Specification:

On Manager and Subordinate interfaces, there must be no combinatorial paths between input and output signals.
How is this achieved given developers can often find they are one clock cycle late when pipelining the tready output?
Splitter
The first thing I notice is the confusion caused by Xilinx's IP configuration tool.

The graphic shows separate ports with excessive widths that do not correspond to the values entered into the configurator. This does not correspond to the generated VHDL as shown below, but the VHDL does confirm to what was requested. This known issue was initially a distraction to getting started.
COMPONENT axis_broadcaster
PORT (
aclk : IN STD_LOGIC;
aresetn : IN STD_LOGIC;
s_axis_tvalid : IN STD_LOGIC;
s_axis_tready : OUT STD_LOGIC;
s_axis_tdata : IN STD_LOGIC_VECTOR(15 DOWNTO 0);
m_axis_tvalid : OUT STD_LOGIC_VECTOR(1 DOWNTO 0);
m_axis_tready : IN STD_LOGIC_VECTOR(1 DOWNTO 0);
m_axis_tdata : OUT STD_LOGIC_VECTOR(31 DOWNTO 0)
);
END COMPONENT;
The AXI Broadcast IP provides the equivalent function of my AXI Splitter. The Xilinx IP does not register the outputs.


Joiner
The AXI Combiner IP provides the equivalent function of my AXI Joiner. The Xilinx IP does not register the outputs.


ITDev
There is an excellent article, "Register ready signals in low latency, zero bubble pipeline" on ITDev, explaining how to achieve a pipelined tready output. I note however that their solution does not have registered outputs since there is an inverter after the tready register before the output port, and multiplexers after the tdata and tvalid registers. But still, the tready path has been broken into separate timing sections.
This raises a subtle point about the AXI specification. It did not state that outputs should be 'registered', meaning the output is sourced directly from a register's Q output. It stated a more lose condition, that "there must be no combinatorial paths between input and output signals". This means there need only be a register in the path from input to output somewhere. The ITDev solution easily meets this condition, and it does mitigate meeting timing closure even if the output is not 'registered'.
Easier Solutions

Xilinx provides a useful piece of AXI IP called a "register slice". The design above has been coded in axi_split_join_ip.vhdl, but its structural nature means it is not amenable to be presented in line here. AXI register slices have been inserted either side of the two AXI delay loads, one of which is the non-pipelined simple version of the AXI delay from the ITDev blog post. I've pulled out the full path for the tready signal. Remember this signal goes from right to left in our diagram to provide back pressure on the input, hence the order of the components reversed.

The pink coloured primitives are the registers, all the others being combinatorial (LUTs). Borrowed from Working With AXI Streaming Data, "AXI Delay 1" is the lower throughput but simplest delay implementation, hence only a single LUT in the tready path. "AXI Delay 2" is the higher throughput but more involved delay implementation with multiple LUTs in the tready path. The AXI Register Slices have fully registered outputs on all paths and make for a neat solution to pipelinine AXI stream paths in general.
Conclusions
- The tready signal appears to cause the most grief when it comes to pipelining AXI stream paths. Pipelining causes responses to be one cycle too late.
- The AXI specification does not insist all outputs are registered, just that a register appears somewhere in the path from input to output.
- Xilinx IP does not always obey the AXI specification.
- The AXI Register Slice is a neat general solution to pipelining.
I think those statements need upacking a bit more. I do not consider it is necessary for the AXI Broadcast, AXI Combiner, and other simple AXI IP Cores, to be registered, because:
- Unregistered LUTs can be subsumed into other incomplete LUTs across hierarchy boundaries by optimisation and save logic depth, thus mitigating the need for possible pipelining.
- An AXI Register Slice is trivial to include in the path if required for timing closure, so should be optional.