I've seen several VHDL implementations that seem to obfuscate the beauty and simplicity of the polynomial division modulo 2 calculation with unnecessary code. I present here the essential single line of VHDL that is required to perform the division inside a clocked VHDL process for any polynomial. I then extend this solution to multiple bits of work per clock cycle with a simple extension of the single line such that between the VHDL and the synthesis tool, all the necessary logic for e.g. byte wide data is derived for you without resorting to any tables in standards documentation. This work actually predates 1999 and therefore the age of blogging, but no one else has written up an equally succinct version of this since then hence it is appearing belatedly.
- Serial Calculation
- Parallel Calculation
- Measuring Parallel Performance
- Alternative Form
- Measuring Parallel Performance
- Conclusions
- References
A quick reminder of the calculation we are trying to perform in Galois Field GF(2). These examples are fully expanded to show the intermediate calculation states on each clock cycle.
Note the order of the powers, highest power on the left is the MSB, lowest power (\(x^0 = 1\)) is the LSB below.
quotient 10100001 x^7+x^5+1 11010 | 111001011010 dividend x^11+x^10+x^9+x^6+x^4+x^3+x divisor | message x^4+x^3+x reg from reset 00000 reg'high = 0 00001 reg 00000 reg'high = 0 00011 reg 00000 reg'high = 0 00111 reg 00000 reg'high = 0 01110 reg 00000 reg'high = 0 11100 reg 11010 reg'high = 1 01101 reg 00000 reg'high = 0 11010 reg 11010 reg'high = 1 00001 reg 00000 reg'high = 0 00011 reg 00000 reg'high = 0 00110 reg 00000 reg'high = 0 01101 reg 00000 reg'high = 0 11010 reg 11010 reg'high = 1 0000 remainder = 0
quotient 10101100 x^7+x^5+x^3+x^2 11011 | 111001010000 dividend x^11+x^10+x^9+x^6+x^4 divisor | message x^4+x^3+x+1 reg from reset 00000 reg'high = 0 00001 00000 reg'high = 0 00011 00000 reg'high = 0 00111 00000 reg'high = 0 01110 00000 reg'high = 0 11100 11011 reg'high = 1 01111 00000 reg'high = 0 11110 11011 reg'high = 1 01011 00000 reg'high = 0 10110 11011 reg'high = 1 11010 11011 reg'high = 1 00010 00000 reg'high = 0 00100 00000 reg'high = 0 0100 remainder = x^2
These examples can be verified using on on-line tool, Galois Field GF(2) Calculator, written by Dr. Richard Tervo. This tool makes for an excellent source of test data with independent verification! The image below shows the generalised logic schematic for performing the division modulo 2 from any polynomial of length 5 and remainder of length 4 bits. Its trivial to see how this generalises for any length polynomial. We discard the quotient as uninteresting, we're only interested in the remainder, for example for the purposes of Cyclic Redundancy Check calculations.
Another neat trick is using Matlab or Octave to generate test data as follows:
pkg load communications
a = gf([1 1 0 0 0 1 0 1]);
b = gf([1 0 1 0]);
% deconvolution is equivalent to division
[q, r] = deconv(a,b);
>> q q = GF(2) array. Array elements = 1 1 1 1 1 >> r r = GF(2) array. Array elements = 0 0 0 0 0 0 1 1
Serial Calculation

Drawn in https://www.circuit-diagram.org/.
Therefore the VHDL for a generic component can now easily be envisaged, for any polynomial coefficients of any length. The code below assumes that the polynomial is not fixed, but it usually is.
library ieee;
use ieee.std_logic_1164.all;
entity polybitdiv is
generic(
len_g : positive := 5
);
port(
clk : in std_ulogic;
reset : in std_ulogic;
-- Polynomial bit order specification is: x^n + .. + x^4 + x^3 + x^2 + x^1 + 1
-- 543210
-- E.g. 100101 represents x^5 + x^2 + x^0
poly : in std_ulogic_vector(len_g-1 downto 0);
data_in : in std_ulogic;
data_valid_in : in std_ulogic;
data_out : out std_ulogic_vector(len_g-2 downto 0)
);
end entity;
architecture rtl of polybitdiv is
subtype reg_t is std_ulogic_vector(len_g-1 downto 0); -- poly'length
signal reg : reg_t;
begin
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
reg <= (others => '0');
elsif data_valid_in = '1' then
-- NB. at this point reg(reg'high) = '0' always
reg <= (reg(reg'high-1 downto 0) & data_in) xor
(poly and reg_t'(others => reg(reg'high-1)));
end if;
end if;
end process;
-- Discard reg(reg'high) = '0'
data_out <= reg(data_out'range);
end architecture;
The essential work is done by a single assignment on line 32. If the highest indexed bit of the remainder in progress is high, add in the polynomial in order to zero the bit about to be shifted out and chalk up a '1' bit in the corresponding it of the quotient. Realistically data is not going to be shifted in serially, but for the purposes of introduction it is a useful step in the illustration of how to create a more parallel implementation.
Parallel Calculation
Real world examples include CRC32 and CRC-based framing using the Header Error Control (HEC) in ATM. The latter cannot use serial data to calculate the HEC as data is framed into at least bytes. Therefore the HEC must be calculated 8-bits wide every clock cycle. Assuming the bytes are correctly framed and not offset, the HEC calculation will indicate the start of an ATM packet, and hence it is possible to separate ATM cells in the data stream. This requires a more parallel computation of the remainder after polynomial division. Essentially performing multiple bits of work per clock cycle, sufficient to cover the width of the input data.
library ieee;
use ieee.std_logic_1164.all;
entity polydiv is
generic(
-- Polynomial bit order specification is: x^n + .. + x^4 + x^3 + x^2 + x^1 + 1
-- 543210
-- E.g. 100101 represents x^5 + x^2 + x^0
poly_g : std_ulogic_vector -- direction 'to' or 'downto' does not matter.
);
port(
clk : in std_ulogic;
reset : in std_ulogic;
data_in : in std_ulogic_vector; -- bus width independent of 'poly_g', direction 'to' or 'downto' does not matter.
data_valid_in : in std_ulogic;
data_out : out std_ulogic_vector(poly_g'length-2 downto 0)
);
end entity;
architecture rtl of polydiv is
subtype reg_t is std_ulogic_vector(poly_g'length-1 downto 0);
signal reg : reg_t;
begin
process(clk)
variable xored_i : reg_t;
begin
if rising_edge(clk) then
if reset = '1' then
reg <= (others => '1');
elsif data_valid_in = '1' then
xored_i := reg;
-- Do multiple bits of divisor per cycle.
for i in data_in'range loop
-- NB. at this point xored_i(xored_i'high) = '0' always
-- Written this way, poly_g can be either "0 to n", or "n downto 0"
xored_i := (xored_i(xored_i'high-1 downto 0) & data_in(i)) xor
(poly_g and reg_t'(others => xored_i(xored_i'high-1)));
end loop;
reg <= xored_i;
end if;
end if;
end process;
data_out <= reg(data_out'range);
end architecture;
The above code has moved to a constant polynomial, which can either be defined as a constant in the architecture or a generic in the entity. The essential work is now wrapped in a for loop, and unwrapped and minimised by synthesis. The use of a constant polynomial means the AND gates disappear and also the EOR gates where there's a '0' in the polynomial. The logic depth is then increased by the additional work per clock cycle. This logic depth is illustrated in the following three thumbnails. They show the work of a single clock cycle as drawn by Vivado where the work of each bit is superimposed (because its a regular array). As the number of bits of parallel work is increased to 2 then 3, the potential logic depth increases linearly and regularly. With a constant polynomial, all the AND gates disappear with any EOR gates no longer required due to the '0's in the polynomial. Since polynomials need to contain a few '0' coefficients, the clock timing never really gets out of hand. A sparse polynomial operates at reasonable speeds for wide data.
Measuring Parallel Performance
Let's put that clock timing claim to the test. We'll calculate a maximum clock speed from the worst-case slack as follows in TCL in Vivado.
Part | xc7z007sclg225-2 |
---|---|
Vivado Version | 2019.1.1 |
set maxSetup [get_property SLACK \
[get_timing_paths -max_paths 1 -nworst 1 -setup]]
set maxClkPeriod [expr [get_property REQUIREMENT \
[get_timing_paths -max_paths 1 -nworst 1 -setup]] - $maxSetup]
# MHz
set maxClkFreq [expr 1/($maxClkPeriod * 1e-9) / 1e6]
Bus Width | Minimum clock period (ns) | Maximum clock frequency (MHz) |
---|---|---|
1 | 1.924 | 519.8 |
2 | 1.924 | 519.8 |
4 | 1.924 | 519.8 |
8 | 1.924 | 519.8 |
16 | 1.943 | 514.7 |
32 | 2.306 | 433.7 |
Update 25/10/2024
Alternative Form
There is also a popular alternative composition of logic for CRCs. It gets a different answer to the implementation above, but achieves the same function of error checking. The two are related, and an explanation of the different form of the implementations is provided in Cyclic Redundancy Check, Mathematical and Hardware Overview.

Again, this form can be created using a generic polynomial as follows:
architecture rtl2 of polydiv is
-- NB. poly_g(0) is assumed to be '1' and ignored. Alias required in order to slice.
alias poly_a : std_ulogic_vector(poly_g'length-1 downto 0) is poly_g;
subtype reg_t is std_ulogic_vector(poly_g'length-2 downto 0);
begin
process(clk)
variable xored_i : reg_t;
variable fb : std_ulogic;
begin
if rising_edge(clk) then
if reset = '1' then
data_out <= (others => '1');
elsif data_valid_in = '1' then
xored_i := data_out;
-- Do multiple bits of divisor per cycle.
for i in data_in'range loop
fb := data_in(i) xor xored_i(xored_i'high);
xored_i := (xored_i(xored_i'high-1 downto 0) & fb) xor
(poly_a(poly_a'high-1 downto 1) & '0' and reg_t'(others => fb));
end loop;
data_out <= xored_i;
end if;
end if;
end process;
end architecture;
In order to verify the above implementation, I compared it with (many of) those that could be generated with online tools. A great looking tool that can generate RTL code specific to a chosen polynomial and provides a detailed explanation is provided at OutputLogic.com. The difference in this blog is that no code generation step is required as the polynomial is a generic and can be changed on-the-fly. With code generation, you need to manually specify the polynomial in a web page, and copy and paste the resulting code back to your local area with each change of polynomial. The good news is, once the polynomial is specified, we both agree on the quotient result, but we write the polynomial in a different order. I simply follow the tap order used at the test data website at UNB, although a new source of test data is required for the alternative format. Two such websites are:
It is worth noting that the result generators above include some parameters as follows, taken from Understanding and implementing CRC (Cyclic Redundancy Check) calculation by Bastian Molkenthin:
Parameter | Explanation |
---|---|
Initial Value | The value used to initialize the CRC value / register. |
Input reflected | If this value is TRUE, each input byte is reflected before being used in the calculation. Reflected means that the bits of the input byte are used in reverse order. So this also means that bit 0 is treated as the most significant bit and bit 7 as least significant. Example with byte 0x82 = b10000010: Reflected(0x82) = Reflected(b10000010) = b01000001 = 0x41. |
Result reflected | If this value is TRUE, the final CRC value is reflected before being returned. The reflection is done over the whole CRC value, so e.g. a CRC-32 value is reflected over all 32 bits. |
Final XOR value | The Final XOR value is xored to the final CRC value before being returned. This is done after the 'Result reflected' step. Obviously a Final XOR value of 0 has no impact. |
Check value [Optional] | This value is not required but often specified to help to validate the implementation. This is the CRC value of input string "123456789" or as an ASCII byte array: [0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39]. |
None of these additional parameters have been implemented here. They are trivial to add as required or as generics for a general solution. This blog concentrates on the generic polynomial implementation.
Measuring Parallel Performance
Given the list of parameters above, I should state them, I am using a CRC-32/MPEG-2, that means:
- Polynomial is \(x^{32}+x^{26}+x^{23}+x^{22}+x^{16}+x^{12}+x^{11}+x^{10}+x^8+x^7+x^5+x^4+x^2+x^1+1\), or poly = "100000100110000010001110110110111"
- no reflections in or out,
- no final XOR,
- calculation is initialised to ones.
As time has passed since the last results, the tools have been upgraded, so the Vivado version used for testing now has changed.
Part | xc7z007sclg225-2 |
---|---|
Vivado Version | 2023.2 |
Original | Alternative | |||
---|---|---|---|---|
Bus Width | Minimum clock period (ns) | Maximum clock frequency (MHz) | Minimum clock period (ns) | Maximum clock frequency (MHz) |
1 | 0.915 | 1092.9 | 1.314 | 761.0 |
2 | 1.037 | 964.3 | 1.372 | 728.9 |
4 | 1.316 | 759.9 | 1.554 | 643.5 |
8 | 1.546 | 646.8 | 2.393 | 417.9 |
16 | 1.943 | 514.7 | 2.186 | 457.5 |
32 | 2.306 | 433.7 | 2.845 | 351.5 |
The results for the original CRC on the higher bus widths continue to agree, butdoubt is cast over the original figures for the lower bus widths, which now allow for higher clock speeds. This may be down to a different usage of constraints since the Out of Context synthesis method has been amended.


The alternative CRC implementation uses one less register at the expense of a reduction in maximum clock speed. The clock speed is so high already and the available gates so plentiful that the difference in performance is irrelevant. More important is getting the desired answer.
Conclusions
In FPGAs, where logic is packed into LUTs, the parallel implementation is limited in clock speed by the delay across only one or two LUTs. It is entirely practical to calculate the CRC32 of a message with 32-bit wide data. It is possible to increase the data width beyond 32-bits too. The logic required for the parallel calculation is derived for you from the polynomial generic with no specialist implementation, making this implementation especially convenient and re-usable.