I have written a utility that generates optimized ANSI C code for CRC
calculation (both bit-by-bit and table-driven) for CRC sizes 1-64 bits
and for arbitrary polynomials.
Here is a link in case someone finds a use for it:
http://mcgougan.se/universal_crc
Best regards,
Danjel McGougan
Danjel,
Great! This helps a lot when folk ask: how do I calculate the crc for
the MegaCyber 9900 tape drive which has this polynomial ...
The resulting code is clear and simple, though perhaps not fully
optimized. I just tried it, and found that the zlib crc is about
three times as fast as the code produced by your program. But the
zlib code is quite a bit longer and the tables larger.
Mark
You might want to compare your code's performance with an optimized
implementation from Intel, and maybe use it in your generator.
http://www.c10n.info/archives/383
Sachin Garg [India]
www.sachingarg.com | www.c10n.info
Thanks for the feedback!
Yeah, I probably shouldn't have called it "optimized" code (even
though a table-driven approach might be considered an optimized
approach by some). If I get the time I will try to expand the utility
to optionally produce faster (but more complex) code.
BR,
Danjel
Thanks! I will check it out.
BR,
Danjel
Hi again!
After feedback I've updated the utility to also include a more
advanced table-driven algorithm. The algorithm used is a
generalization of the crc32 code in zlib (I think the original zlib
patch was from Rodney Brown in 2002) and uses multiple tables to break
up some dependencies between lookups. This increases efficiency of the
code for superscalar cores (more instructions per cycle can be
executed).
The new algorithm comes in two flavors: byte processing and word
processing. When doing word processing (memory is read 32 bits at a
time) the XOR operation can be efficiently combined into one
instruction and further performance improvements can be seen. Note
that when using word-at-a-time mode a preprocessor flag needs to be
set correctly when compiling the generated code to define the
endianness of the platform compiled for.
The new algorithms are named tabi and tabiw respectively in the
utility.
Here are some performance numbers from the home page (http://
mcgougan.se/universal_crc). The numbers are for an AMD Athlon 64 using
gcc 3.4.4 with flags "-O3 -march=athlon64".
bits algo cycles/byte
---- ---- -----------
5 bit 34.00
8 bit 33.97
13 bit 57.00
16 bit 33.00
23 bit 58.00
32 bit 34.00
49 bit 85.00
64 bit 86.00
5 tab 6.56
8 tab 6.56
13 tab 8.00
16 tab 8.00
23 tab 7.00
32 tab 7.00
49 tab 11.00
64 tab 11.01
5 tabi 2.19
8 tabi 2.19
13 tabi 2.94
16 tabi 2.47
23 tabi 3.25
32 tabi 3.25
49 tabi 10.53
64 tabi 10.53
5 tabiw 2.01
8 tabiw 2.01
13 tabiw 2.31
16 tabiw 2.31
23 tabiw 2.10
32 tabiw 2.10
49 tabiw 6.21
64 tabiw 6.21
Note that 2.1 cycles per byte for 32 bit CRCs is pretty much on par
with the recently published code by Intel, so I don't think I'll add
the Intel algorithm unless someone can give me a good reason.
BR,
Danjel