Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Python 32-bit CRC module needed?

956 views
Skip to first unread message

James C. Ahlstrom

unread,
Feb 26, 1997, 3:00:00 AM2/26/97
to

The other day I ran across a 32-bit CRC function which seems to be
public domain and ports easily to Unix and Windows. I used it to
calculate CRC checksums of files to see if they needed to be
transfered via ftp to a foreign system.

I recall from previous c.l.p discussions that some people use hash()
to calculate checksums for strings. Perhaps this CRC function would
be a useful C-language module to add to Python. It could find CRC's
of both files and strings.

Does Python have such a module? Is this generally useful? If so, I
volunteer to port the CRC function to a Python C module, test to be
sure identical checksums are obtained on Unix and Windows, and to
write documentation. Attached is the head of the file. Has anyone
heard of this before?

Jim

--- snip ---

/**********************************************************************\
|* *|
|* Demonstration program to compute the 32-bit CRC used as the frame *|
|* check sequence in ADCCP (ANSI X3.66, also known as FIPS PUB 71 *|
|* and FED-STD-1003, the U.S. versions of CCITT's X.25 link-level *|
|* protocol). The 32-bit FCS was added via the Federal Register, *|
|* 1 June 1982, p.23798. I presume but don't know for certain that *|
|* this polynomial is or will be included in CCITT V.41, which *|
|* defines the 16-bit CRC (often called CRC-CCITT) polynomial. FIPS *|
|* PUB 78 says that the 32-bit FCS reduces otherwise undetected *|
|* errors by a factor of 10^-5 over 16-bit FCS. *|
|* *|
\**********************************************************************/

/* Need an unsigned type capable of holding 32 bits; */
typedef unsigned long int UNS_32_BITS;

/*
* Copyright (C) 1986 Gary S. Brown. You may use this program, or
* code or tables extracted from it, as desired without restriction.
*/

Guido van Rossum

unread,
Feb 26, 1997, 3:00:00 AM2/26/97
to

> The other day I ran across a 32-bit CRC function which seems to be
> public domain and ports easily to Unix and Windows. I used it to
> calculate CRC checksums of files to see if they needed to be
> transfered via ftp to a foreign system.
>
> I recall from previous c.l.p discussions that some people use hash()
> to calculate checksums for strings. Perhaps this CRC function would
> be a useful C-language module to add to Python. It could find CRC's
> of both files and strings.
>
> Does Python have such a module? Is this generally useful? If so, I
> volunteer to port the CRC function to a Python C module, test to be
> sure identical checksums are obtained on Unix and Windows, and to
> write documentation. Attached is the head of the file. Has anyone
> heard of this before?

I tend to use the md5 module for this purpose -- it's a much stronger
checksum than CRC-32. On the other hand, various Internet formats
seem to use a specific CRC-32 implementation. If the code you found
matches those requirements, it may be very useful.

--Guido van Rossum (home page: http://www.python.org/~guido/)

Fredrik Lundh

unread,
Feb 27, 1997, 3:00:00 AM2/27/97
to

> Does Python have such a module?

There's one (undocumented) in PIL which calculates CRC-32 according to
ISO 3307. Haven't the slightest if this the same as ANSI X3.66...

Cheers /F

PS. Don't forget that the size of a long can be either 32 or 64-bits.
PIL's wrapper returns the CRC as a 2-tuple of 16-bit fields...

Tim Peters

unread,
Feb 27, 1997, 3:00:00 AM2/27/97
to

> [jim ahlstrom]
> The other day I ran across a 32-bit CRC function ...
> Is this generally useful?

For someone doing work with one of the stds that uses the function, sure! For
plain old string hashing, probably less useful than you'd hope -- CRC
polynomials were explicitly & excruciatingly crafted to catch the most
probable kinds of errors on physical communication channels, and some of those
(like bursts of inverted bits, or a block of spurious 0 or 1 bits at the
start) aren't of much interest to plain old hashing. Or, in other words,
polynomials most suitable for CRC apps aren't those most suitable for hashing,
although the former should do "OK" on the latter (e.g., I believe all the CRCs
in common use deliver different results for any two bitstrings that differ in
a single bit, and for "most" that differ in exactly two bit positions).

One thing to look out for is that "unsigned long" isn't 32 bits on all the
machines graced by Python's presence.

although-it-is-on-all-the-ones-that-matter<cackle>-ly y'rs - tim

Tim Peters tim...@msn.com, t...@dragonsys.com
not speaking for Dragon Systems Inc.

Andrew Kuchling

unread,
Feb 27, 1997, 3:00:00 AM2/27/97
to

The zlib module includes 3 CRCs: crc32, adler32, and pgp24.
According to the docs, "An Adler-32 checksum is almost as reliable as
a CRC32 but can be computed much more quickly." The pgp24 CRC is
identical to the one used in PGP's ASCII armoring module. The code's
at http://people.magnet.com/~amk/files/py/zlib105.tgz .


Andrew Kuchling
a...@magnet.com
http://people.magnet.com/%7Eamk/
Save the Gutenberg Project! http://www.promo.net/pg/nl/pgny_nov96.html

William Lewis

unread,
Mar 5, 1997, 3:00:00 AM3/5/97
to

In article <5f1t4h$9...@intrt8.interet.com>,

James C. Ahlstrom <j...@interet.com> wrote:
>Does Python have such a module? Is this generally useful? If so, I
>volunteer to port the CRC function to a Python C module, [...]

The PIL library has one CRC32 implementation. I've written a module
that does truly general CRCs (for any width, polynomial, and bit order
you care to name), and it'll use the fast table driven technique if
the width is a multiple of eight bits. However, it lacks polish and a
bit of documentation, so I haven't released it. My Real Job keeps
getting in the way. If you (or anyone else) would be willing to help
me knock a few more rough edges off and get it out the door, I'd be
delighted. I've noticed that the longer I sit on a piece of code the
more effort is duplicated elsewhere...

(This is all an offshoot of my earlier attempt to generalize binascii
so it would be useful for formats other than the handful it was designed
for. Someday I'll finish that one too. And then, the world! Muahahaha...)

--
Wim Lewis / wiml@{omnigroup.com|hhhh.org} / I do not speak for Omni
PGP 0x27F772C1: 0C 0D 10 D5 FC 73 D1 35 26 46 42 9E DC 6E 0A 88

James C. Ahlstrom

unread,
Mar 11, 1997, 3:00:00 AM3/11/97
to

In article <5fl02p$ku7$1...@reason.omnigroup.com>,

William Lewis <wi...@omnigroup.com> wrote:
>In article <5f1t4h$9...@intrt8.interet.com>,
>James C. Ahlstrom <j...@interet.com> wrote:
>>Does Python have such a module? Is this generally useful? If so, I
>>volunteer to port the CRC function to a Python C module, [...]
>
>The PIL library has one CRC32 implementation. I've written a module
>that does truly general CRCs (for any width, polynomial, and bit order

Actually, the md5 module seems to do the job better. I have
tentatively decided a CRC module isn't needed, and if improvements
are needed they should be made to md5.

Jim

Fredrik Lundh

unread,
Mar 11, 1997, 3:00:00 AM3/11/97
to

> Actually, the md5 module seems to do the job better. I have
> tentatively decided a CRC module isn't needed, and if improvements
> are needed they should be made to md5.

Well, I guess MD5 doesn't really help those of us that need to
generate e.g. ISO 3307 checksums, does it? But except for that,
MD5 is pretty ok ;-)

Cheers /F (http://hem1.passagen.se/eff)


PS. Is MD5 equally good for all kinds of data, or is it better tuned
for text? Just wondering.

PPS. Possible improvement to MD5: add a hexdigest method (almost
every place I've seen it used have immediately converted the result to
a hex string)...

James C. Ahlstrom

unread,
Mar 12, 1997, 3:00:00 AM3/12/97
to

In article <970311165...@arnold.image.ivab.se>,

Fredrik Lundh <fredri...@ivab.se> wrote:
>PS. Is MD5 equally good for all kinds of data, or is it better tuned
>for text? Just wondering.

I don't know. But md5 was designed for getting a data fingerprint. And
CRCs were designed for detecting transmission line errors (as someone
posted before).

Jim

Andrew Kuchling

unread,
Mar 12, 1997, 3:00:00 AM3/12/97
to

Fredrik Lundh asked:

> PS. Is MD5 equally good for all kinds of data, or is it better tuned
> for text? Just wondering.

It doesn't matter what the input data looks like. MD5 starts
with a fixed initial value and handles the input in 512-bit chunks,
combining the current digest state with the input in a complex
nonlinear function.

MD5 is a hash function and not just a checksum, so it's
supposed to be difficult to find two strings that produce the same
hash value; that's not the case for CRCs, and it's an important
property for digital signatures. However, note that MD5 is starting
to look kind of shaky; Hans Dobbertin has shown that it's possible to
choose an initial value such that it's easy to find two messages that
produce the same hash value(see
http://www-cse.ucsd.edu/users/bsy/dobbertin.ps).

While this attack doesn't yet apply to existing MD5 systems
where the IV is fixed (as in Python's md5 module), the attack will
probably be extended sometime in the next few years. So, if collision
resistance is important and you don't have to be compatible with an
installed base, avoid MD5 and use SHA instead. If you don't care
about collision resistance, use MD5 since it's faster than SHA.

Siggy Brentrup

unread,
Mar 12, 1997, 3:00:00 AM3/12/97
to

Fredrik Lundh <Fredri...@ivab.se> writes:

> PS. Is MD5 equally good for all kinds of data, or is it better tuned
> for text? Just wondering.

It's widely used for binary files, e.g. in the Debian/GNU Linux distribution.

> PPS. Possible improvement to MD5: add a hexdigest method (almost
> every place I've seen it used have immediately converted the result to
> a hex string)...

It's trivial to derive a class HexMD5(md5).

Once we're at MD5 sums, here's my problem:

I'm checking MD5 sums on the fly while downloading files with ftplib. With my
shaky connection I have to restart on partially downloaded files quite often,
requiring to recompute the MD5 sum for the partial file before downloading the
remaining part.

Can someone more knowledgable than me comment on saving the state of the
MD5sum FSM. I understand that it requires modifications to the C sources.
Until now I refrained from that since I don't fully understand security
implications.

Thanks
Siggy

--
Siggy Brentrup <b...@debian.org> aka: b...@uni-muenster.de
*** PGP public key available from keyservers ***
PGP fingerprint = C8 95 66 8C 75 7E 10 A2 05 61 C7 7F 05 B6 A4 DF

Reimer Behrends

unread,
Mar 16, 1997, 3:00:00 AM3/16/97
to

Another late reply ...

Tim Peters (tim...@msn.com) wrote:
[CRCs.]
: For someone doing work with one of the stds that uses the function, sure! For

: plain old string hashing, probably less useful than you'd hope -- CRC
: polynomials were explicitly & excruciatingly crafted to catch the most
: probable kinds of errors on physical communication channels, and some of those
: (like bursts of inverted bits, or a block of spurious 0 or 1 bits at the
: start) aren't of much interest to plain old hashing. Or, in other words,
: polynomials most suitable for CRC apps aren't those most suitable for hashing,
: although the former should do "OK" on the latter (e.g., I believe all the CRCs
: in common use deliver different results for any two bitstrings that differ in
: a single bit, and for "most" that differ in exactly two bit positions).

Well. This is not exactly correct. While CRCs were indeed created to
catch transmission errors (although their popularity is probably due to
the simplicity of hardware implementations of the algorithm), they are
in fact usually excellent hash functions, much better and usually more
efficient than all the handcrafted ones.

For the following, I'll assume knowledge of the article I posted several
weeks ago on how to find polynomials to generate crc values.

First, concerning transmission errors. Let's assume that b_k, ..., b_0
is a sequence of bits, and b(x) = sum(i=0 to k: b_i*x^i) the associated
polynomial over the field {0, 1}. Let r(x) be the irreducible polynomial
used to create GF(2^n) and v(x) = sum(i=0 to n-1: v_i*x^i) the initial
value. Then for each bit v is recomputed as follows:

v(x) := (v(x)*x + b_i*x^n) mod r(x),

with i going from n to 0. Therefore, the resulting crc value can be
described by:

crc(x) = (v(x)*x^(k+1) + b(x)*x^n) mod r(x).

Assuming a single bit is flipped during transmission, say b_s, resulting
in a different value crc'(x), we get:

crc(x) - crc'(x) = (x^s * x^n) mod r(x)

However, as r(x) is irreducible, this value can never be zero, therefore
crc(x) <> crc'(x). So all one-bit errors are caught.

Assuming a two-bit error in bit positions b_s and b_t, we get (assuming
without loss of generality that t > s):

crc(x) - crc'(x) = (x^n*(x^s+x^t)) mod r(x)
= (x^n * x^s (1 + x^(t-s))) mod r(x)

Clearly, this value is zero iff r(x) divides one of the factors. r(x) is
irreduible, therefore it can't divide either x^n or x^s. Second, r(x) is
constructed to give: x^j mod r(x) = x implies j is a multiple of 2^n-1.
Therefore, the value is zero iff the bit errors occur in positions
exactly 2^n-1 apart. It follows that for blocks shorter than that, all
two-bit errors are detected, meaning also that one-bit errors can be
repaired (by simply trying repeatedly which bit to change back until the
crc value is correct).

So much for transmission errors. However, as I indicated above, crc
values usually make very good hash functions, too. Explanation:

Assume that r(x) = sum(i=0 to n: r_i*x^i) is the generating polynomial
for the crc function. Let C be the n-by-n matrix over the field {0, 1}
defined as follows:

C[i, i+1] = 1 for 0 <= i < n
C[i, 0] = r_(n-i-1)
C[i, j] = 0 for all other values of i, j.

Then, if we take v = (v_(n-1), ..., v_0) be the initial value and a word
b = (b_(n-1), ..., b_0) to be encoded, we can simply use (note that
addition in {0, 1} means exclusive-or):

v := C * (v + (b_i, 0, ...., 0) )

Or, defining M := C^n, we can xor the entire bit vector b onto v and
then perform n steps by a single matrix multiplication:

v := M * (v + b)

Assume now that we have a sequence w_k, ..., w_1 of n-bit vectors
(words) to encrypt. The resulting value is:

M^k*v + sum(i=1 to k: M^i*w_i).

Given how C is constructed, we have C*v = 0 iff v = 0. Hence C is
invertible, and therefore also M, M^2, ..., M^k. Therefore, no such
matrix has a row of zeroes, or two or more linearly dependent rows.
This implies (actually, I'm taking a bit of a shortcut here, the actual
considerations are a bit more complicated) that for most crc polynomials
basically each bit of the output is the result of xoring a fairly random
selection of about 50% of the input bits together (plus the constant bit
derived from M^k*v). In other words, changing a single bit in the input
will on average change half the bits in the output. As two randomly
selected n-bit values will on average have n/2 of their bits in common,
that's about the best we can achieve. Hence my claim that crc values are
among the best hash functions available.

Note that despite this, the hash function is cryptographically weak. It
is fairly simple, given the above formula, to reverse engineer lots of
patterns that result in the same crc value, as long as there are more
than n bits that are encoded.

: One thing to look out for is that "unsigned long" isn't 32 bits on all the

: machines graced by Python's presence.

It is fairly trivial to supply irreducible polynomials for any wordsize
desired (especially as I posted the necessary algorithms a while ago).
Besides, 32 bits are usually more than enough for almost any application
I can think of.

Reimer Behrends


Tim Peters

unread,
Mar 18, 1997, 3:00:00 AM3/18/97
to

>> [tim]
>> ... CRC polynomials were explicitly ... crafted to catch the most
>> probable kinds of errors on physical communication channels
>> ... polynomials most suitable for CRC apps aren't those most

>> suitable for hashing, although the former should do "OK" on the
>> latter ...

> [reimer behrends]
> ... [another excellent exposition, of the theory behind CRC, strangely
> edited by tim for a reason to be clear later]
> ... Let r(x) be the irreducible polynomial ...
> ... However, as r(x) is irreducible, ...
> ... r(x) is irreduible, therefore ...

I have to vanish for about a week & am too pressed for time now. May be able
to comment more later. First, thanks for the writeup! Nobody explains this
stuff better than you.

Now for the bad news <wink>: the polys used by most communications-inspired
CRCs are not irreducible -- so the parts of the theory above that rely on
irreducibility are moot <private wink>.

You can tell they're reducible by counting the bits. E.g., the Ethernet
32-bit CRC poly is 0x04C11DB7, with 14 bits set. Even parity implies it's
divisible by x+1, so it's reducible. Similarly most communications-inspired
CRCs are multiples of x+1.

They do this on purpose: if r(x) = (x+1)*r2(x), then all transmissions with
an odd number of bit errors are caught (not just all those with single-bit
errors), because no bitstring with odd parity (i.e., no error vector with an
odd number of 1 bits) is divisible by x+1. And not divisible by x+1 implies
not divisible by any multiple of x+1, so in particular not divisible by r(x).

Note that Knuth talks about at least one approach to crafting polys
specifically for hashing; the CRC algorithms don't care at all *which* poly
they use, so I stick to my claim that if you're specifically interested in
hashing, you're better off building a poly *for* hashing.

>> : One thing to look out for is that "unsigned long" isn't 32 bits on
>> all the machines graced by Python's presence.

> It is fairly trivial to supply irreducible polynomials for any wordsize
> desired (especially as I posted the necessary algorithms a while ago).
> Besides, 32 bits are usually more than enough for almost any application
> I can think of.

We're talking past each other there: the original issue was how to *code* the
algorithms in the Python implementation. Most CRC code out there written in C
implicitly assumes that "unsigned long" is a 32-bit thing, and may fail to
work correctly on machines where that's not true. Python's implementation
tries very hard to use standard C, because there's too many machines and
compilers out there to make piles of platform-specific #ifdef'ing tolerable.
I.e., wasn't commenting on the polys or the method at all there, just warning
Jim about an implementation subtlety.

divisible-by-everything-except-x+1-ly y'rs - tim

t...@dragonsys.com

unread,
Mar 19, 1997, 3:00:00 AM3/19/97
to

This remote access stuff is funky! Apologies if
this shows up 18 times -- or, like prior attempts,
none <wink>]

> [tim]
> ... E.g., the Ethernet 32-bit CRC poly is 0x04C11DB7,


> with 14 bits set. Even parity implies it's divisible
> by x+1, so it's reducible. Similarly most communications-
> inspired CRCs are multiples of x+1.

So how's *that* for the worst example ever posted?!
The Ethernet poly is actually an important exception.
In my haste to get out the door, I forgot that CRC polys
are conventionally reported as hex numbers with the
leading bit implicit. So if you count an even # of bits
in the hex #, the poly actually has an odd # of bits, &
vice versa. Don't blame me -- I just pretend to live
on this planet <wink>.

Anyway, most communications-inspired CRC polys (although
not Ethernet's!), and especially the older ones, are
divisible by x+1, & for the reason given. E.g., and
spelling out the polys in full to preclude similar confusion:

CRC-12 x^12 + x^11 + x^3 + x^2 + x + 1
CRC-16 x^16 + x^15 + x^2 + 1
CRC-CCITT x^16 + x^12 + x^5 + 1

blunderingly y'rs - tim

Tim Peters tim...@msn.com, t...@dragonsys.com
not speaking for Dragon Systems Inc.

-------------------==== Posted via Deja News ====-----------------------
http://www.dejanews.com/ Search, Read, Post to Usenet

michael shiplett

unread,
Mar 20, 1997, 3:00:00 AM3/20/97
to

Regarding the topic of CRCs, the current issue of ``Dr. Dobb's
Journel'' has an article on them.

Were I Tim, the next line would be...
running-light-without-overbyte-ly y'rs

michael


Reimer Behrends

unread,
Apr 4, 1997, 3:00:00 AM4/4/97
to

Tim Peters (tim...@msn.com) wrote:
[CRC polynomials.]
: I have to vanish for about a week & am too pressed for time now. May be able
: to comment more later. First, thanks for the writeup! Nobody explains this
: stuff better than you.

Well, what can I say? I seem to be even pressed more for time than you,
considering that it's been more than two weeks. :-)

: Now for the bad news <wink>: the polys used by most communications-inspired

: CRCs are not irreducible -- so the parts of the theory above that rely on
: irreducibility are moot <private wink>.

<grin> Oh, I don't claim to be an expert in communication theory. It's
just that (1) with irreducible polynomials the math is far easier (I'm
that lazy, yes), and (2) I'm not really that interested in CRCs for
anything but as a tool to implement certain library functions (esp.
hashing and cryptography).

: You can tell they're reducible by counting the bits. E.g., the Ethernet

: 32-bit CRC poly is 0x04C11DB7, with 14 bits set. Even parity implies it's
: divisible by x+1, so it's reducible. Similarly most communications-inspired

: CRCs are multiples of x+1.

As you already pointed out yourself, the polynomial is in fact
irreducible. However ...

: They do this on purpose: if r(x) = (x+1)*r2(x), then all transmissions with

: an odd number of bit errors are caught (not just all those with single-bit
: errors), because no bitstring with odd parity (i.e., no error vector with an
: odd number of 1 bits) is divisible by x+1. And not divisible by x+1 implies
: not divisible by any multiple of x+1, so in particular not divisible by r(x).

Yes, neat idea. Not that it would make much of a difference, as the
probability of an undetected error is only 1/2^n, anyway. At least for
hashing purposes.

: Note that Knuth talks about at least one approach to crafting polys

: specifically for hashing; the CRC algorithms don't care at all *which* poly
: they use, so I stick to my claim that if you're specifically interested in
: hashing, you're better off building a poly *for* hashing.

That's a given, anyway. You'd have to check whether it works, so you
don't get a degenerate case. However, let's study reducible polynomials
in more detail.

One shortcoming with a reducible polynomial r(x) = p(x)*q(x) is that the
value of x^k modulo r(x) doesn't cover all non-zero polynomials with
degree less than r(x). If it did, then 1, x, x^2, ... modulo r(x) would
form a cyclic group, and polynomials with addition and multiplication
would form a field, in contradiction to p(x)*q(x) modulo r(x) = 0.

However, that doesn't tell us anything about hashing. I took a few
shortcuts in my previous article out of sheer laziness, something
which I can remedy now. If you construct the matrices for computing
CRC values as I outlined before, they will still be invertible as
long as the coefficient of x^0 for the generating polynomial is 1
(trivial application of Gauss's algorithm). However, as you will find,
some polynomials like r(x) = x^n + 1 will give you _very_ regular
looking matrices, and are of little use for generating hash functions.
But this is not generally true for the reducible polynomials used
for communications (after all, even communications experts want some
security for the case where an even number of bits are wrong :-) ). I'd
have to do some more research to work out the details, though, and I
don't have the time right now (not to mention that the library is closed
at this time of day :-) ).

[Not every unsigned long is 32 bits in size.]
: We're talking past each other there: the original issue was how to *code* the

: algorithms in the Python implementation. Most CRC code out there written in C
: implicitly assumes that "unsigned long" is a 32-bit thing, and may fail to
: work correctly on machines where that's not true. Python's implementation
: tries very hard to use standard C, because there's too many machines and
: compilers out there to make piles of platform-specific #ifdef'ing tolerable.
: I.e., wasn't commenting on the polys or the method at all there, just warning
: Jim about an implementation subtlety.

Point taken and agreed. :-)

Btw, it would be nice to have a CRC function in any case, given that
it would be at least 1000 times as fast as MD5 (a deliberate design
decision for MD5 to make breaking it by exhaustive approaches harder).

[...]

Reimer Behrends


0 new messages