I've seen several VHDL implementations that seem to obfuscate the beauty and simplicity of the polynomial division modulo 2 calculation with unnecessary code. I present here the essential single line of VHDL that is required to perform the division inside a clocked VHDL process for any polynomial. I then extend this solution to multiple bits of work per clock cycle with a simple extension of the single line such that between the VHDL and the synthesis tool, all the necessary logic for e.g. byte wide data is derived for you without resorting to any tables in standards documentation. This work actually predates 1999 and therefore the age of blogging, but no one else has written up an equally succinct version of this since then hence it is appearing belatedly.
A quick reminder of the calculation we are trying to perform in Galois Field GF(2). These examples are fully expanded to show the intermediate calculation states on each clock cycle.
Note the order of the powers, highest power on the left is the MSB, lowest power (\(x^0 = 1\)) is the LSB below.
These examples can be verified using on on-line tool, Galois Field GF(2) Calculator, written by Dr. Richard Tervo. This tool makes for an excellent source of test data with independent verification! The image below shows the generalised logic schematic for performing the division modulo 2 from any polynomial of length 5 and remainder of length 4 bits. Its trivial to see how this generalises for any length polynomial. We discard the quotient as uninteresting, we're only interested in the remainder, for example for the purposes of Cyclic Redundancy Check calculations.
Another neat trick is using Matlab or Octave to generate test data as follows:
Serial Calculation
Drawn in https://www.circuit-diagram.org/.
Therefore the VHDL for a generic component can now easily be envisaged, for any polynomial coefficients of any length. The code below assumes that the polynomial is not fixed, but it usually is.
The essential work is done by a single assignment on line 32. If the highest indexed bit of the remainder in progress is high, add in the polynomial in order to zero the bit about to be shifted out and chalk up a '1' bit in the corresponding it of the quotient. Realistically data is not going to be shifted in serially, but for the purposes of introduction it is a useful step in the illustration of how to create a more parallel implementation.
Parallel Calculation
Real world examples include CRC32 and CRC-based framing using the Header Error Control (HEC) in ATM. The latter cannot use serial data to calculate the HEC as data is framed into at least bytes. Therefore the HEC must be calculated 8-bits wide every clock cycle. Assuming the bytes are correctly framed and not offset, the HEC calculation will indicate the start of an ATM packet, and hence it is possible to separate ATM cells in the data stream. This requires a more parallel computation of the remainder after polynomial division. Essentially performing multiple bits of work per clock cycle, sufficient to cover the width of the input data.
has moved to a constant polynomial, which can either be defined as a constant in the architecture or a generic in the entity. The essential work is now wrapped in a for loop, and unwrapped and minimised by synthesis. The use of a constant polynomial means the AND gates disappear and also the EOR gates where there's a '0' in the polynomial. The logic depth is then increased by the additional work per clock cycle. This logic depth is illustrated in the following three thumbnails. They show the work of a single clock cycle as drawn by Vivado where the work of each bit is superimposed (because its a regular array). As the number of bits of parallel work is increased to 2 then 3, the potential logic depth increases linearly and regularly. With a constant polynomial, all the AND gates disappear with any EOR gates no longer required due to the '0's in the polynomial. Since polynomials need to contain a few '0' coefficients, the clock timing never really gets out of hand. A sparse polynomial operates at reasonable speeds for wide data.
Measuring Parallel Performance
Let's put that clock timing claim to the test. We'll calculate a maximum clock speed from the worst-case slack as follows in TCL in Vivado.
Part | xc7z007sclg225-2 |
---|---|
Vivado Version | 2019.1.1 |
Bus Width | Requested clock period (ns) | Requested clock frequency (MHz) |
---|---|---|
1 | 1.924 | 519.8 |
2 | 1.924 | 519.8 |
4 | 1.924 | 519.8 |
8 | 1.924 | 519.8 |
16 | 1.943 | 514.7 |
32 | 2.306 | 433.7 |
Conclusions
In Xilinx FPGAs, where logic is packed into LUTs, the parallel implementation is limited in clock speed by the delay across only one or two LUTs. It is practical to calculate the CRC32 of a message with 32-bit wide data. The logic required for the parallel calculation is derived for you from the polynomial generic with no specialist implementation.