128-bit scrambling and CRC computations

Ovidiu Lupas

unread,

Nov 29, 2001, 4:38:37 AM11/29/01

to

Hi all,

In my current project I have to implement scrambling and CRCs over a
128-bit data bus at a clock rate of 100 MHz. My combinatorial areas
are huge and I am having problems meeting the speed requirements.

Could someone give me an hint how to overcome this problem ?
Any hints will be appreciated.

Thank you for your time.

Best regards,
Ovidiu Lupas.

rickman

unread,

Nov 29, 2001, 9:31:12 AM11/29/01

to

You can separate out the inputs from the feedback signals and pipeline
the input signals. This will help a lot. However, roughly half the
signals will be feedback which you can not pipeline. With such a large
data bus, you may still have problems meeting timing.

I know someone who may be able to help you further. I will send this
message to him.

--

Rick "rickman" Collins

rick.c...@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

rickman

unread,

Nov 29, 2001, 10:49:12 AM11/29/01

to

Just out of curiosity, is this a paying job or for the open cores thing?
I might be able to get you some professional help if you are working on
the open cores thing. That was meant in the EE sense, not in the psyc
sense of professional help...

BTW, you might also look at the implemented logic to see if
optimizations are being done on the logic. There are a lot of duplicate
terms in the equations you end up with and you should see sharing of
logic between the bits. This can also slow you down a bit if done
poorly. Not a bad place for hand optimization of the source code.

Ovidiu Lupas

unread,

Nov 29, 2001, 2:36:06 PM11/29/01

to

Thank you for your reply.

This is a paying job. For OpenCores, I am currently working on a
development board, at board level (schematic, PCB).

Unfortunately, in the past year, my time available for OC activities
almost disapeared ... Now, I restarted the activity. I hope that soon
there will be seen some results.

Nicholas Weaver

unread,

Nov 29, 2001, 4:15:04 PM11/29/01

to

In article <c2088d4a.01112...@posting.google.com>,

Ovidiu Lupas <olu...@opencores.org> wrote:
>Hi all,
>
>In my current project I have to implement scrambling and CRCs over a
>128-bit data bus at a clock rate of 100 MHz. My combinatorial areas
>are huge and I am having problems meeting the speed requirements.
>
>Could someone give me an hint how to overcome this problem ?
>Any hints will be appreciated.

What exactly are your scrambling requirements?

For making your CRC fast, reduce the math by unrolling the CRC (it's
pretty easy to do) to make wider X-or trees. 100 MHz should be no
problem on a modern part, I have a Rijndael core I'm building in a
Spartan II-5, it runs at >110 MHz.
--
Nicholas C. Weaver nwe...@cs.berkeley.edu

glen herrmannsfeldt

unread,

Nov 29, 2001, 8:01:37 PM11/29/01

to

olu...@opencores.org (Ovidiu Lupas) writes:

>Hi all,

>In my current project I have to implement scrambling and CRCs over a
>128-bit data bus at a clock rate of 100 MHz. My combinatorial areas
>are huge and I am having problems meeting the speed requirements.

>Could someone give me an hint how to overcome this problem ?
>Any hints will be appreciated.

Well, I think the hint is pipelining. I think you can pipeline
the CRC algorithm.

Do you mean 12,800 Mbit/second? That is what 128 bit at 100MHz should
mean. The traditional hardware CRC is bit serial, but the common
software implementation is byte serial with a 256 entry lookup table.
If you have 16 such tables and enough pipeline registers I think
it can be done.

-- glen

Allan Herriman

unread,

Nov 29, 2001, 9:46:21 PM11/29/01

to

On Thu, 29 Nov 2001 21:15:04 +0000 (UTC), nwe...@CSUA.Berkeley.EDU
(Nicholas Weaver) wrote:

>In article <c2088d4a.01112...@posting.google.com>,
>Ovidiu Lupas <olu...@opencores.org> wrote:
>>Hi all,
>>
>>In my current project I have to implement scrambling and CRCs over a
>>128-bit data bus at a clock rate of 100 MHz. My combinatorial areas
>>are huge and I am having problems meeting the speed requirements.
>>
>>Could someone give me an hint how to overcome this problem ?
>>Any hints will be appreciated.
>
>What exactly are your scrambling requirements?

One would guess that as 128 bits x 100MHz = 12.8Gbps, this would be
either RFC 2615 POS over OC192, or 10G Ethernet.

From http://www.ietf.org/rfc/rfc2615.txt?number=2615

"4. X**43 + 1 Scrambler Description

The X**43 + 1 scrambler transmitter and receiver operation are as
follows:

Transmitter schematic:

Unscrambled Data
|
v
+-------------------------------------+ +---+
+->| --> 43 bit shift register --> |--->|xor|
| +-------------------------------------+ +---+
| |
+-----------------------------------------------+
|
v
Scrambled Data
"

Please note that this is the 'serial prototype' and doesn't look
anything like the parallel version that the OP requires.

>100 MHz should be no
>problem on a modern part, I have a Rijndael core I'm building in a
>Spartan II-5, it runs at >110 MHz.

I suspect that the OP could actually drop the clock rate to ~85MHz and
still meet the line rate requirements.

Regards,
Allan.

rickman

unread,

Nov 29, 2001, 9:57:48 PM11/29/01

to

Nice try, but you can't pipeline the full CRC calculation. Since the
feedback term is needed from the full 128 bit register on every clock
cycle, you can't pipeline the calc. If you try you don't have the full
128 bit feedback inputs until all 16 bytes have been processed.

The CRC calculations always boil down to the XOR of a set of inputs and
feedback signals, or consider this a single bit sum. You can separately
sum the input signals and the feedback signals. Each output bit to be
calculated will use about half of the inputs and half of the feedback
bits. So you can use a pipeline to "precalculate" the input signals down
to one bit. But the feedback bits always have to be done in one clock
cycle. You might be able to optimize the speed by using the block ram as
a wide fan-in XOR gate. Depends on the part you are targeting. There may
also be a way to use the multipliers or adders in some FPGA families. A
32 bit adder can "modulo one count" the ones in a 32 bit word. Or two 16
bit adders halve the carry time and can be summed in one LUT. With a
little hand tuning, you might be able to use the MSB LUT for this. I bet
Ray A has a Viewlogic macro that already does this :)

I bet this is a CRC-32 (or CRC-16) being done on an OC-192 at 10 Gbps.
That is where I saw this done before. The initial signal is bit serial,
but the payload is being processed in an FPGA at about 80 or so MHz in a
128 bit wide word. Just a guess.

Ovidiu Lupas

unread,

Nov 30, 2001, 11:33:01 AM11/30/01

to

> I bet this is a CRC-32 (or CRC-16) being done on an OC-192 at 10 Gbps.
> That is where I saw this done before. The initial signal is bit serial,
> but the payload is being processed in an FPGA at about 80 or so MHz in a
> 128 bit wide word. Just a guess.
>

Exactly, OC-192 data that has to be processed 128-bit wide at 90 MHz.
Actually, it is an ATM O.191 Test Cell processor, and I cannot afford
pipelining at this point.

Thanks,
Ovidiu

glen herrmannsfeldt

unread,

Nov 30, 2001, 8:16:04 PM11/30/01

to

allan_herrim...@agilent.com (Allan Herriman) writes:
>(Nicholas Weaver) wrote:

>>Ovidiu Lupas <olu...@opencores.org> wrote:
>>>
>>>In my current project I have to implement scrambling and CRCs over a
>>>128-bit data bus at a clock rate of 100 MHz. My combinatorial areas
>>>are huge and I am having problems meeting the speed requirements.
>>>
>>>Could someone give me an hint how to overcome this problem ?
>>>Any hints will be appreciated.
>>
>>What exactly are your scrambling requirements?

>One would guess that as 128 bits x 100MHz = 12.8Gbps, this would be
>either RFC 2615 POS over OC192, or 10G Ethernet.

>From http://www.ietf.org/rfc/rfc2615.txt?number=2615

>"4. X**43 + 1 Scrambler Description

> The X**43 + 1 scrambler transmitter and receiver operation are as
> follows:

> Transmitter schematic:

> Unscrambled Data
> |
> v
> +-------------------------------------+ +---+
> +->| --> 43 bit shift register --> |--->|xor|
> | +-------------------------------------+ +---+
> | |
> +-----------------------------------------------+
> |
> v
> Scrambled Data
>"

>Please note that this is the 'serial prototype' and doesn't look
>anything like the parallel version that the OP requires.

That is neat. x**43+1 is not in the table in 'Numerical Recipes',
but it is real convenient not to have many 1's in it.

I believe that makes the parallel implementation much easier.

I once did a software implementation of x**64+x**4+x**3+x+1,
using 32 bit math. It is convenient in not having any terms
from x**63 down to x**32, so it is easy to do in 32 bit int's.

It should then take one or two 256x44 lookup tables, a
small number of XOR gates, and enough latches to make the
pipeline work. Much easier than I was expecting for a CRC
with more terms in it. Doing it 8 bits at a time is convenient
for software implementations. In this case, the optimal number
may be different depending on memory cost and latch cost.

The 8 bit parallel C macro looks like:

#define UPDC32(x,y) (crc_32_tab[((x)^(y))&0xff]^((y>>8)&0xffffff))

Where x is the new byte, and y is the accumulated CRC value.

Any book on pipelined processor architecture should help you
understand how to arrange the latches to pipeline the computation.

-- glen

Allan Herriman

unread,

Dec 1, 2001, 5:34:42 AM12/1/01

to

On 1 Dec 2001 01:16:04 GMT, g...@ugcs.caltech.edu (glen herrmannsfeldt)
wrote:

... or you could write it in behavioural VHDL or Verilog and let the
synthesiser do the work of unrolling the loops. I've tried this for
LFSRs in VHDL at these rates, and yes, it does work. Much easier to
read than C, too.

Perhaps instead of "it does work" I should say "it can be made to
work," for design isn't trivial at these rates ;-)

Regards,
Allan.

rickman

unread,

Dec 1, 2001, 1:10:53 PM12/1/01

to

That is exactly where we were doing it. I belive this is CRC-16, right?
If you can't do pipelining at all because you need to match delays with
other cell processing, then you will have about 128 inputs to the XOR
tree. This will take four levels of logic (4 input LUTs) if you can
force the synthesizer to keep the tree balanced.

The Altera parts can use a fast cascade which is much faster than a LUT,
but is serial much like a carry bit. Four levels of cascade should work
pretty well to eliminate a LUT and keep the speed up. You will need to
play with the sythesis controls a bit to get this working or instantiate
it.

In the Xilinx parts you can't use the Block RAM unless you can pipeline.
the RAM is synchronous and requires a clock. But it will take much more
RAMs than are on a chip, so this won't work regardless. With no
minimization, this is a BIG hunk of logic at about 5,500 LUTs!!!

Take a good look at your architecture. You should be able to pipeline
this. One way is to delay the data in parallel paths by one register. It
only uses 128 FFs and may save you a lot of grief when you make changes
to the design and the timing breaks. Also keep in mind that logic can be
shared between bits.

Good luck!

Nicholas Weaver

unread,

Dec 1, 2001, 2:10:12 PM12/1/01

to

In article <c2088d4a.01113...@posting.google.com>,

Definatly look at unrolling and specializing the CRC calculation.
CRC's are normally serial, but the N'th CRC is always a function of
the XOR of a set of bits of the current state and of the input.

Often the easiest way to do this is to write a little C program which
spits out the Xor equations for each output CRC bit and just implement
that. You can even have your C-program spit out HDL and just cut &
paste it.

Each bit of the output will be dependant on approximatly 1/2 the input
bits, so it will mostly be on the order of ~64 bit XORs (some slightly
more). As a balanced tree of 4-luts, this is 4 levels of logic. A
bit much to run at 80 MHz in most FPGAs, but may be possible if you
are careful on the packing. Adding a pipeline stage (EASY, cheap,
etc) and it becomes nice and straightforward if you don't have
feedback between your 128b data blocks.

Ray Andraka

unread,

Dec 2, 2001, 2:33:31 PM12/2/01

to

The CRC is a bit of a pain for speed because the feedback has to happen in a
single sample time, and it usually winds up consisting of 3 or 4 levels of
logic. When the clock rate is high (such as your 100 MHz 128 bit parallel),
you run into the limitations of the FPGA architecture. You can sometimes gain
a little bit by fixing non-optimal trees, but from what I've seen the synthesis
already does a decent job with building the tree. Floorplanning will help
tremendously in Xilinx; the Xilinx placer does wonderfully putting the levle of
logic closest to the flip-flop with the associated flip-flop, but the level
feeding that usually gets placed much farther away than is necessary or
sensible. The floorplanning of the combinatorial stuf can be a royal pain,
since it is dependent on the naming of the synthesized logic or requires LUT
instantiation of the whole tree.

I've found the easiest way to deal with it is to double your word width so that
you can halve the clock. Doubling the word width typically adds one level of
logic, but you get twice the time to traverse the tree. There is a pretty good
CRC VHDL code generator available on the web (I forget the address at the
moment, but there is a link to it from the links page on my website) which will
generate RTL VHDL code for arbitrary word widths and polynomials (combinatorial
description only). Doubling the word width requires a second rank of registers
at the input to convert from single to double word at half rate.

Nicholas Weaver wrote:

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email r...@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759

rickman

unread,

Dec 2, 2001, 3:25:30 PM12/2/01

to

Ray Andraka wrote:
>
> The CRC is a bit of a pain for speed because the feedback has to happen in a
> single sample time, and it usually winds up consisting of 3 or 4 levels of
> logic. When the clock rate is high (such as your 100 MHz 128 bit parallel),
> you run into the limitations of the FPGA architecture. You can sometimes gain
> a little bit by fixing non-optimal trees, but from what I've seen the synthesis
> already does a decent job with building the tree. Floorplanning will help
> tremendously in Xilinx; the Xilinx placer does wonderfully putting the levle of
> logic closest to the flip-flop with the associated flip-flop, but the level
> feeding that usually gets placed much farther away than is necessary or
> sensible. The floorplanning of the combinatorial stuf can be a royal pain,
> since it is dependent on the naming of the synthesized logic or requires LUT
> instantiation of the whole tree.

I deal with the floorplanning not by instantiating, but by constructing
my logic to fit 4 input LUTs and then putting a "keep" on the
interconnecting signals. It does not always force a known name, but
normally it does. This is much easier than instantiation and is also
portable between different targets.

> I've found the easiest way to deal with it is to double your word width so that
> you can halve the clock. Doubling the word width typically adds one level of
> logic, but you get twice the time to traverse the tree. There is a pretty good
> CRC VHDL code generator available on the web (I forget the address at the
> moment, but there is a link to it from the links page on my website) which will
> generate RTL VHDL code for arbitrary word widths and polynomials (combinatorial
> description only). Doubling the word width requires a second rank of registers
> at the input to convert from single to double word at half rate.

Doubling the word width is a good idea! But I am pretty sure that most
designs will work at about 100 MHz in today's parts.

Be careful with the CRC code generator. There are several ways to use
CRC and I think it only supports one of them. We had an engineer use it
to design his equations only to find out a week later, on the bench,
that he had built the wrong type of CRC! In simulation the two ends
matched, so no error. CRC sounds like a nice simple concept, but in
practice there are many pitfalls in many areas.

Chua Kah Hean

unread,

Dec 3, 2001, 5:37:44 AM12/3/01

to

Hi all gurus out there,

I got very curious after reading all the posts in this thread.

I know that we can use a lookup table method to implement CRC in
parallel. E.g. we can use a 256-byte table to calculate the CRC 8-bit
per cycle.

It seems to me that to use the same trick for a 128-bit input would
require a 2^128 element table, which must be a no-go.

Many people have talked about things like unrolling and pipelining the
input. Can anybody point me to a source where such approaches are
explained in greater detail so that I can apprepiate what you all have
been driving at?

Thanks in advance.

TA TA
kahhean

Allan Herriman

unread,

Dec 3, 2001, 7:30:18 AM12/3/01

to

On 3 Dec 2001 02:37:44 -0800, kah...@hotmail.com (Chua Kah Hean)
wrote:

Instead of a monster lookup table mimicking a bunch of XOR gates, just
use the XOR gates directly. Many of the terms cancel out: A xor A =
0, 0 xor A = A, etc. so the number of xor gates usually isn't
excessive and you avoid the exponential growth in table size.
(It's actually the depth of the xor gates, not the number of them,
that matters, because the depth determines the delay and hence the
clock rate.)

Take a look at the logic generated by some of the free online parallel
CRC generators:
http://www.easics.be/webtools/crctool
http://www.geocities.com/steve0192/vhdl.htm

The first one (crctool) will generate a function that turns an input
word and a feedback word into a new CRC value, which is the feedback
word for the next clock.

Here's the logic generated by crctool for one bit of a 16 bit CRC with
128 bit input word:

D := Data; -- the input word
C := CRC; -- the feedback word

NewCRC(0) := D(127) xor D(125) xor D(124) xor D(123) xor D(122) xor
D(121) xor D(120) xor D(111) xor D(110) xor D(109) xor
D(108) xor D(107) xor D(106) xor D(105) xor D(103) xor
D(101) xor D(99) xor D(97) xor D(96) xor D(95) xor
D(94) xor D(93) xor D(92) xor D(91) xor D(90) xor D(87) xor
D(86) xor D(83) xor D(82) xor D(81) xor D(80) xor D(79) xor
D(78) xor D(77) xor D(76) xor D(75) xor D(73) xor D(72) xor
D(71) xor D(69) xor D(68) xor D(67) xor D(66) xor D(65) xor
D(64) xor D(63) xor D(62) xor D(61) xor D(60) xor D(55) xor
D(54) xor D(53) xor D(52) xor D(51) xor D(50) xor D(49) xor
D(48) xor D(47) xor D(46) xor D(45) xor D(43) xor D(41) xor
D(40) xor D(39) xor D(38) xor D(37) xor D(36) xor D(35) xor
D(34) xor D(33) xor D(32) xor D(31) xor D(30) xor D(27) xor
D(26) xor D(25) xor D(24) xor D(23) xor D(22) xor D(21) xor
D(20) xor D(19) xor D(18) xor D(17) xor D(16) xor D(15) xor
D(13) xor D(12) xor D(11) xor D(10) xor D(9) xor D(8) xor
D(7) xor D(6) xor D(5) xor D(4) xor D(3) xor D(2) xor
D(1) xor D(0) xor C(8) xor C(9) xor C(10) xor C(11) xor
C(12) xor C(13) xor C(15);

(Switch to fixed point font.)

Here's the logic you'll end up with:

clock-----------------------+
|
+-------+ +----------+
| huge | | register |
input-->| xor |----->|d q|--+-> CRC out
(128) | tree | (16) | | | (16)
+-------+ +----------+ |
^ |
| |
+------------------------+
feedback (16)

The "speed" is determined by the minimum clock period, which in this
case is limited by the number of logic levels in the xor tree - i.e.
the maximum delay between any flip flop output and any flip flop
input.
You can't do anything with this directly, as the feedback must happen
in a single clock cycle.

If you look more closely at the logic expression, you'll see that it
can be decomposed into the form (input xor feedback) where input is
the xor of a bunch of input bits, and feedback is the xor of a bunch
of feedback bits.

This leads to the following design:

clock--------------------------------------+
|
+-------+ +-------+ +----------+
| medium| | small | | register |
input-->| xor |----->| xor |----->|d q|--+-> CRC
(128) | tree | (16) | tree | (16) | | | out
+-------+ +-------+ +----------+ | (16)
^ |
| |
+------------------------+
feedback (16)

This isn't any faster than the first attempt, but notice that the
"medium xor tree" is not in the feedback path. This means it can be
pipelined - we can put flip flops in the logic so that the calculation
is performed over several clock cycles. The logic depth between any
flip flop output and any flip flop input is reduced - we can have a
faster clock.

This is shown here:

clock-----------------------+-----------------------------
|
+-------+ +----------+ +-------+ +-
| medium| | register | | small | |
input-->| xor |----->|d q|----->| xor |----->|d
(128) | tree | (16) | | (16) | tree | (16) |
+-------+ +----------+ +-------+ +-
^
|
+------------
feedbac

(I pruned the right side to avoid line wrap, but you should get the
idea.)

In theory the synthesis tools can do all this for you. E.g. you can
describe a serial CRC calculation, put it in a for loop to iterate
over the input word, tell it how many clock cycles to take, and the
synthesiser should spit out something equivalent to the above.
(I have used this approach with LFSRs with some success at these bit
rates.)

I could make a comment about the relative benefits of HDLs and
schematics for high speed design, but I don't want to ignite yet
another religious war.

Regards,
Allan.

rickman

unread,

Dec 3, 2001, 11:12:41 AM12/3/01

to

I am not clear about how you generated this logic, but it does not match
the general problem. Even though there are only 16 bits in the CRC,
there should be 128 bits in the "feedback" register as well as in the
input. This means that there would be about the same number of feedback
signals to the "small" XOR tree as there are input signals to the medium
tree. So pipelining will improve your complexity roughly by a factor of
2, but not so much more as your analysis above indicates. This of course
does not reduce the number of logic levels by 2, but only a half LUT
when using 4 input LUTS.

Try this with a very simple one like X43. You start with 43 bits in the
register and have to add one bit for every extra bit in the input word.
If you have 16 bits in at one time, you need a 58 bit feedback word.

Hmmm... does that mean that there should be 128 + C - 1 bits in the
register, where C is the size of your CRC? I don't remember that being
the case.

Allan Herriman

unread,

Dec 3, 2001, 10:55:43 PM12/3/01

to

On Mon, 03 Dec 2001 11:12:41 -0500, rickman <spamgo...@yahoo.com>
wrote:

Are you sure about this, Rick?

I checked with other engineers here, and I checked a design that's
happily generating CRCs in the field, and it all points to a CRC-n
needing exactly 'n' bits of feedback regardless of the input word
width.

Perhaps you are trying to solve a different problem?

Regards,
Allan.

rickman

unread,

Dec 4, 2001, 12:06:55 AM12/4/01

to

Allan Herriman wrote:
> >I am not clear about how you generated this logic, but it does not match
> >the general problem. Even though there are only 16 bits in the CRC,
> >there should be 128 bits in the "feedback" register as well as in the
> >input.
>
> Are you sure about this, Rick?
>
> I checked with other engineers here, and I checked a design that's
> happily generating CRCs in the field, and it all points to a CRC-n
> needing exactly 'n' bits of feedback regardless of the input word
> width.
>
> Perhaps you are trying to solve a different problem?
>
> Regards,
> Allan.

Or perhaps I am not remembering it correctly. I worked on this a year
ago and helped another engineer with a 128 bit version while I worked on
an 32 bit version. I may remember the details wrong. If the feedback is
only n bits, then getting the speed should be a slam dunk as long as you
can use pipelining for the inputs.

Allan Herriman

unread,

Dec 4, 2001, 12:38:42 AM12/4/01

to

On Tue, 04 Dec 2001 00:06:55 -0500, rickman <spamgo...@yahoo.com>
wrote:

>Allan Herriman wrote:
>> >I am not clear about how you generated this logic, but it does not match
>> >the general problem. Even though there are only 16 bits in the CRC,
>> >there should be 128 bits in the "feedback" register as well as in the
>> >input.
>>
>> Are you sure about this, Rick?
>>
>> I checked with other engineers here, and I checked a design that's
>> happily generating CRCs in the field, and it all points to a CRC-n
>> needing exactly 'n' bits of feedback regardless of the input word
>> width.
>>
>> Perhaps you are trying to solve a different problem?
>>
>> Regards,
>> Allan.
>
>Or perhaps I am not remembering it correctly. I worked on this a year
>ago and helped another engineer with a 128 bit version while I worked on
>an 32 bit version. I may remember the details wrong. If the feedback is
>only n bits, then getting the speed should be a slam dunk as long as you
>can use pipelining for the inputs.

I make it that the clock rate is more-or-less determined solely by the
feedback, i.e. by the CRC order ('n'). For an 'm' bit input word, the
throughput is proportional to 'm'. Just make the input bus wider, and
you get more bits per second, without limit!

You don't get something for nothing, though, as the number of
pipeline stages on the input bus is something like O(log m), and the
number of flip flops is something like O(m log m).

Bye,
Allan.

Chua Kah Hean

unread,

Dec 4, 2001, 1:52:39 AM12/4/01

to

Hi all gurus out there,

First of all, a million thanks to Allan for the write-up. It all
makes sense now.

Any way, I did some binary long division manually (word by word rather
than bit by bit) to have a feel of how the xor logic is generated. It
becomes clear to me that whatever the input word length, the number of
bits that have to be "carried over" to the next word is only C=number
of CRC bits. And these are the feedback bits to the small xor tree
in Allan's diagram.

If we can pipeline the input word, increasing the input word length
will not have any impact on the circuit speed. This looks almost
magical to me. :-) But so far I am convinced it works.

By the way, is there any web site/book which have a list of such
interesting digital design problems? I love brain teasers like this
(although I am lousy at solving them).

Thanks.

TA TA
kahhean