This function is of interest from the point of view of comparing two different implementations, one coded iteratively and the other tail recursively.
Recursion makes a program more readable, but it gives poor performance. Iterative procedures give good performance but are not that readable and may require a local variable to store an intermediate value (mutability). Using tail recursion you will get the best of both worlds and no "sum" variable is needed (immutability). This is very useful in calculating large number sums or factorials, because you will never get a stackoverflow exception as you just forward the result to the next recursive function call.
In a mathematical sense, they're interchangable. All iterative solutions can be written with recursion, all recursive solutions can be expressed iteratively.
Practically, recursive solutions tend to be cleaner but sometimes less readily understood, and unless your language supports tail recursion optimization, recursion will be slow or wasteful as you're generating shitloads of function calls and wasting more space than you would with some iterative solution.
In hardware, recursion does make sense when implementing n-ary trees of finite depth that is known at compile time. Not all tools support this though. A notable tool that fails to support general recursion is Xilinx's Vivado, but even that does support "tail recursion", and will be used here to experiment with and display the results.
Synthesied Structures
The code is too involved to be usefully provided here, instead see the GitHub source tree. The following two elaborated designs demonstrate that both iterative and recursive implementations can produce the same result. The main difference between the two is the inclusion of additional levels of hierarchy.


One aspect worth a brief thought is the case where more clock cycles than are really necessary are requested for the pipelining. This might be requested in order to match a parallel path. In the limit, n bits of shift only requires at most n clock cycles over which to perform the work. So when more than n clock cycles are requested, we add pure delay to pad. With this function it matters little which end the padding is provided. One small factor is the pipelining of the shift signal which saves on registers when the bits are used up before padding.

The example above shows the padding when a 4-bit data bus is rotated by 2 bits of shift over 3 clock cycles. As only 2 clock cycles are necessarily required, the third is then just padding to meet the requested delay.
Results
Synthesis Tool | Vivado v2023.2 (64-bit) |
Project Part | Kintex 7 xc7k70tfbv676-1 |
A TCL script was used to automate the generation of results to CSV file and hence plotting in your favourite spreadsheet software.
set path "...path_to/Barrel_Shift"
# What is Pulse Width Slack? How to calculate? How to rectify if negative slack occurs?
# https://adaptivesupport.amd.com/s/question/0D54U00007DvAqySAF/what-is-pulse-width-slack-how-to-calculate-how-to-rectify-if-negative-slack-occurs?language=en_US
#
# Return the open design's maximum clock frequency in MHz.
proc fmax {} {
set tp [get_timing_paths -max_paths 1 -nworst 1 -setup]
set setup [get_property SLACK $tp]
set clk_period [expr [get_property REQUIREMENT $tp] - $setup]
# MHz divide by ns * MHz => (1e-9 * 1e6) = 1e-3
return [expr 1e3/$clk_period]
}
proc result {shift_bits num_clks recurse} {
upvar resfile resfile
set_property generic "shift_bits_g=$shift_bits shift_left_g=true num_clks_g=$num_clks recursive_g=$recurse" [current_fileset]
reset_run synth_1
launch_runs synth_1 -jobs 8
wait_on_run synth_1
open_run synth_1
puts $resfile "$shift_bits,$num_clks,[fmax],[llength [get_cells -hier -filter {PRIMITIVE_GROUP == LUT}]],[llength [get_cells -hier -filter {PRIMITIVE_GROUP == FLOP_LATCH}]]"
}
foreach recurse {false true} {
foreach shift_bits {9 10} {
if {$recurse} {
set resfilename [format "%s/results_%sbits_recursive.txt" $path $shift_bits]
} else {
set resfilename [format "%s/results_%sbits_iterative.txt" $path $shift_bits]
}
set resfile [open $resfilename w]
puts $resfile "Shift Bits,Number of Clocks,Maximum Frequency (MHz),LUTs,Registers"
flush $resfile
for {set i 1} {$i <= [expr $shift_bits + 1]} {incr i} {
result $shift_bits $i $recurse
flush $resfile
}
close $resfile
}
}


Note the the practical upper limit on clock frequency from the MMCM primitives is ignored by using this fmax procedure. The changes in clock frequency will correspond to the changes in LUT depth between registers. The changes in the uneven line for the LUT count will correspond to how efficiently the LUTs are packed. Inefficient packing will consume more LUTs, i.e. the number of inputs to the LUTs is not 6, 36, and 216 for LUT depths of 1, 2 and 3 respectively. The number of registers remains entirely proportional to the number of clock periods over which the function is pipelined.
Alternative Implementation
The initial code used a for loop with a variable to unwrap multiple rotations per clock cycle, depending on how many shift bits were required to be used. A simpler implementation could avoid the loop and use VHDL operators ror and rol to perform a variable shift in one assignment per pipeline stage. Both the iterative and recursive VHDL files have two architectures demonstrating the difference between the manually coded shift and the rotation operators. The following chart shows clock speed results for the new architectures.

These performance results are less favourable. Looking at the elaboration of the logic, shown below for one pipeline stage, there are more expensive operators inferred and these do not seem to get simplified down to the manually code version's simplicity of LUTs. Looking at the chart, the point at which there is only 2-bits of shift consumed is where the performance difference converges, in the range 5 to 7 pipeline stages. This implies that logic optimisation has not achieved the minimum number of LUTs for 3 or more bits which is why the clock speed suffers. Simply don't use these operators and take the slightly more involved initial architecture instead.

Conclusions
Assuming both VHDL implementations are near optimal, the iterative version is shorter, probably because it avoids the two multi-line instantiations. They both nest generate statments quite heavily, which makes the code harder to read. The iterative version cannot handle the decaying size of the pipelined shift signal. The shift_i signal remains a two-dimensional array with unassigned bits once they have been used and the synthesis tool removes the unused bits. The recursive version handles this detail perfectly. If I was going to pick one for a real design, it would probably be the iterative version as it is less scary to those who have not encountered recursion before.
So there's the code for instantiation barrel shift functions for bus sizes larger than you can imagine. The remaining question is, where you one actually employ this scaling pipelined function anyway?
References
- Github Source Code
- Other blogs on recursive components