How should/could I combine CRCs?

Daryle Walker

unread,

Sep 17, 2008, 8:55:31 PM9/17/08

to

I'm working on a CRC programming object (again), and I'm wondering how
to combine two CRCs together, knowing their initial states. I've
glanced at some CRC code in the zlib project, and thought on how I
could adapt it, but I don't know how it really works, which means that
I'm stuck if problems occur. So I'm trying to find my own solution.
Can you guys check if I'm on the right track?

The CRC checksum method depends on treating a stream of message bits
as a polynomial with modulo-2 coefficients, with each subsequent bit
at the next lower order. So a bit sequence Sn with length Ln is
treated as a degree Ln - 1 modulo-2 polynomial. A CRC is formed from
the remainder of dividing the message polynomial by a key polynomial.
Such keys are ranked by their degree. A degree-32 modulo-2 polynomial
has 33 terms, which can be stored in a 32-bit register because the
highest-order bit must be set and so can be implicit and not stored.
After dividing by the key, the remainder also has 32 terms (since it's
of degree 31) and can be stored in the same register type. The
message is treated as the dividend and is fed into a register at its
lower-order end starting with the first encounter bit (which is
treated as the message's high-order bit). The dividend/remainder
register is initially zero, and so requires 32 steps before the first
message bit is processed through the division. Therefore, many
implementations start off with a non-zero initial remainder.

So we're finding:

Sn * x^d = Qn * G + Cn [1]

where Sn is the message bit-sequence polynomial, x is the polynomial
unknown, d is the degree of G, Qn is the quotient (which is ignored),
G is the key polynomial, and Cn is the remainder polynomial and the
CRC result. The "x^d" term makes sure all the bits of Sn move to the
top of the register and get analyzed.

This equation does not take an initial remainder into account. I'm
guessing that the equation would be modified to:

(In * x^Ln + Sn) * x^d = Qn * G + Cn [2]

where In is the initial remainder polynomial and Ln is the number of
bits in Sn. The following equations depend on my guess here on the
initial remainder's interpretation being correct, so please correct me
if I'm wrong here!

My problem is that, given two CRC bit sequences S1 and S2, find the
CRC for the bit sequence for Sc, which is the sequence S1 followed by
S2. So:

(I1 * x^L1 + S1) * x^d = Q1 * G + C1 [3]
(I2 * x^L2 + S2) * x^d = Q2 * G + C2 [4]
(Ic * x^Lc + Sc) * x^d = Qc * G + Cc [5]

and we're looking for Cc in terms of as few of the other parameters of
the first two sequences as possible. Now:

Sc = S1 * x^L2 + S2 [6]
Lc = L1 + L2 [7]

giving:

(Ic * x^(L1 + L2) + (S1 * x^L2 + S2)) * x^d = Qc * G + Cc [8]

-> ((Ic * x^L1 + S1) * x^L2 + S2) * x^d = Qc * G + Cc [9]

-> (Ic * x^L1 + S1) * x^d * x^L2 + S2 * x^d = Qc * G + Cc [10]

Since I'm implementing this as S2's attributes being appended to the
object representing S1, then I1 == Ic. Also, I'll replace "S2 * x^d"
with its substitute in [4].

(Q1 * G + C1) * x^L2 + (Q2 * G + C2 - I2 * x^L2 * x^d) = Qc * G + Cc
[11]

Rearrange terms to:

(Q1 * x^L2 + Q2) * G + (C1 - I2 * x^d) * x^L2 + C2 = Qc * G + Cc
[12]

We're only concerned with the remainder after division by G, so all
the multiples of G can be ignored:

(C1 - I2 * x^d) * x^L2 + C2 == Cc (mod G) [13]

So Cc can be found by reducing the left side of [13] by dividing by
G. We only need to know the two sequences' running CRCs, the length
of the second sequence, the initial remainder of the second sequence,
and the degree of G. My CRC-object already performs long division by
G to create the initial CRCs, so I can just reuse that code!

We can restructure as:

-I2 * x^(d + L2) + C1 * x^L2 + C2 == Cc (mod G) [14]

Remember that I2, C1, and C2 are all bit polynomials of length d
(degree d - 1) since they're remainders. We are using modulo-2
arithmetic, so addition _and_ subtraction can be implemented with the
exclusive-or (XOR) operation and negation is the same as identity.
This can help arrange the bits into my existing feeding divider. The
bits of I2 must be beyond the C1 and C2 group, and can be fed first.
The C1/C2 group is fed directly afterwards, but its composition is
dependent on the value of L2:

* If L2 is zero, then C1 and C2 bits coincide and their XOR'd
combination is fed to the divider.
* If L2 equals d, then the bits of C1 can be fed without change,
followed by the bits of C2 without change.
* If L2 exceeds d, then do the same as the previous case, but separate
the C1 and C2 feedings by L2 - d zero/false bits being fed in.
* If 0 < L2 < d, feed the unaffected high-order bits of C1, then the
XOR of the overlap between the low-order bits of C1 and the high-order
bits of C2, then the unaffected low-order bits of C2. The overlap is
of d - L2 in length.

Will this work? And what is Mr. Adler doing in "crc32.c" of the zlib
code? Are they equivalent? And how? (Also, note that you only have
to check that the two CRC objects use the same key polynomial; the
initial remainder polynomials can differ!)

Daryle Walker

Mark Adler

unread,

Sep 18, 2008, 1:51:01 AM9/18/08

to

On Sep 17, 5:55 pm, Daryle Walker <dary...@gmail.com> wrote:
> I'm wondering how to combine two CRCs together, knowing their initial states.

...

> And what is Mr. Adler doing in "crc32.c" of the zlib code?

That's DR. Adler to you buddy! :-)

crc32_combine() in zlib is based on two key tricks. For what follows,
we set aside the fact that the standard 32-bit CRC is pre and post-
conditioned. We can deal with that later. Assume for now a CRC that
has no such conditioning, and so starts with the register filled with
zeros.

Trick #1: CRCs are linear. So if you have stream X and stream Y of
the same length and exclusive-or the two streams bit-by-bit to get Z,
i.e. Z = X ^ Y (using the C notation for exclusive-or), then CRC(Z) =
CRC(X) ^ CRC(Y). For the problem at hand we have two streams A and B
of differing length that we want to concatenate into stream Z. What
we have available are CRC(A) and CRC(B). What we want is a quick way
to compute CRC(Z). The trick is to construct X = A concatenated with
length(B) zero bits, and Y = length(A) zero bits concatenated with B.
So if we represent concatenation simply by juxtaposition of the
symbols, X = A0, Y = 0B, then X^Y = Z = AB. Then we have CRC(Z) =
CRC(A0) ^ CRC(0B).

Now we need to know CRC(A0) and CRC(0B). CRC(0B) is easy. If we feed
a bunch of zeros to the CRC machine starting with zero, the register
is still filled with zeros. So it's as if we did nothing at all.
Therefore CRC(0B) = CRC(B).

CRC(A0) requires more work however. Taking a non-zero CRC and feeding
zeros to the CRC machine doesn't leave it alone. Every zero changes
the register contents. So to get CRC(A0), we need to set the register
to CRC(A), and then run length(B) zeros through it. Then we can
exclusive-or the result of that with CRC(B) = CRC(0B), and we get what
we want, which is CRC(Z) = CRC(AB). Voila!

Well, actually the voila is premature. I wasn't at all satisfied with
that answer. I didn't want a calculation that took a time
proportional to the length of B. That wouldn't save any time compared
to simply setting the register to CRC(A) and running the B stream
through. I figured there must be a faster way to compute the effect
of feeding n zeros into the CRC machine (where n = length(B)). So
that leads us to:

Trick #2: The CRC machine is a linear state machine. If we know the
linear transformation that occurs when we feed a zero to the machine,
then we can do operations on that transformation to more efficiently
find the transformation that results from feeding n zeros into the
machine.

The transformation of feeding a single zero bit into the CRC machine
is completely represented by a 32x32 binary matrix. To apply the
transformation we multiply the matrix by the register, taking the
register as a 32 bit column vector. For the matrix multiplication in
binary (i.e. over the Galois Field of 2), the role of multiplication
is played by and'ing, and the role of addition is played by exclusive-
or'ing.

There are a few different ways to construct the magic matrix that
represents the transformation caused by feeding the CRC machine a
single zero bit. One way is to observe that each column of the matrix
is what you get when your register starts off with a single one in
it. So the first column is what you get when the register is 100...
and then feed a zero, the second column comes from starting with
0100..., etc. (Those are referred to as basis vectors.) You can see
this simply by doing the matrix multiplication with those vectors.
The matrix multiplication selects the column of the matrix
corresponding to the location of the single one.

Now for the trick. Once we have the magic matrix, we can set aside
the initial register contents for a while, and instead use the
transformation for one zero to compute the transformation for n
zeros. We could just multiply n copies of the matrix together to get
the matrix for n zeros. But that's even worse than just running the n
zeros through the machine. However there's an easy way to avoid most
of those matrix multiplications to get the same answer. Suppose we
want to know the transformation for running eight zero bits, or one
byte through. Let's call the magic matrix that represents running one
zero through: M. We could do seven matrix multiplications to get R =
MxMxMxMxMxMxMxM. Instead, let's start with MxM and call that P. Then
PxP is MxMxMxM. Let's call that Q. Then QxQ is R. So now we've
reduced the seven multiplications to three. P = MxM, Q = PxP, and R =
QxQ.

Now I'm sure you get the idea for an arbitrary n number of zeros. We
can very rapidly generate transformation matrices Mk, where Mk is the
transformation for running 2 to the power k zeros through. (In the
paragraph above M3 is R.) We can make M1 through Mk with only k
matrix multiplications, starting with M0 = M. k only has to be as
large as the number of bits in the binary representation of n. We can
then pick those matrices where there are ones in the binary
representation of n and multiply them together to get the
transformation of running n zeros through the CRC machine. So if n =
13, compute M0 x M2 x M3.

If j is the number of one's in the binary representation of n, then we
just have j - 1 more matrix multiplications. So we have a total of k
+ j - 1 matrix multiplications, where j <= k = floor(log-base-2(n)).

Now we take our rapidly constructed matrix for n zeros, and multiply
that by CRC(A) to get CRC(A0). We can compute CRC(A0) in O(log(n))
time, instead of O(n) time. We exclusive or that with CRC(B) and
Voila! (really this time), we have CRC(Z).

That's what zlib's crc32_combine() does.

I will leave it as an exercise for the reader as to how to deal with
the pre and post conditioning of the CRC register. You just need to
apply the linearity observations above. Hint: You don't need to know
length(A). In fact crc32_combine() only takes three arguments:
CRC(A), CRC(B), and length(B) (in bytes).

Mark

Christian

unread,

Sep 19, 2008, 5:38:54 AM9/19/08

to

Hi Mark,

first, I thought that this crc32_combine() routine would be of limited
use. Then I came up with an excellent use case: a parallel CRC32!

The parallelization of CRC32 becomes almost trivial with this
function:
1) spawn some threads
2) divide your data into equally-sized blocks
3) let each thread calculate the CRC32 of its block using crc32()
4) combine all local CRCs using crc32_combine()

It took me only a couple of minutes to implement this using pthreads
and zlib. Tests on a quad-core AMD Opteron system showed the following
throughputs:
original crc32(): 700 MiB/sec
parallel crc32 with 1 thread: 700 MiB/sec (x1.00 => no noticable
overhead for large data)
parallel crc32 with 2 threads: 1380 MiB/sec (x1.97)
parallel crc32 with 4 threads: 2400 MiB/sec (x3.43)
(memcpy() achieves 2200 MiB/sec)

I think this is a quite impressive speedup! I expected the memory bus
speed to become a bottleneck much earlier.

Mark, do you consider to have a parallel crc32() in one of the next
releases of zlib? All existing applications using zlib's crc32() could
directly benefit from the multicore era.

Best regards,
Christian

Mark Adler

unread,

Sep 19, 2008, 9:38:28 AM9/19/08

to

On Sep 19, 2:38 am, Christian <iB...@gmx.de> wrote:
> first, I thought that this crc32_combine() routine would be of limited
> use. Then I came up with an excellent use case: a parallel CRC32!

Yes, that is one of the intended applications. pigz (parallel gzip)
uses crc32_combine().

> It took me only a couple of minutes to implement this using pthreads
> and zlib.

...

> Mark, do you consider to have a parallel crc32() in one of the next
> releases of zlib?

Yes, I am considering that, as well as parallel compression. However
I am unsure as to the portability of the POSIX pthread library. zlib
needs to work under a very wide range of operating systems, which it
does currently. For example, is pthread available under Windows?

Mark

Wojciech Muła

unread,

Sep 19, 2008, 10:33:16 AM9/19/08

to

Mark Adler <mad...@alumni.caltech.edu> wrote:

> Yes, I am considering that, as well as parallel compression. However
> I am unsure as to the portability of the POSIX pthread library. zlib
> needs to work under a very wide range of operating systems, which it
> does currently. For example, is pthread available under Windows?

Yes http://sourceware.org/pthreads-win32/, but license is LGPL.

w.

whygee

unread,

Sep 21, 2008, 7:07:53 AM9/21/08

to

Hello,

Mark Adler wrote:
> On Sep 17, 5:55 pm, Daryle Walker <dary...@gmail.com> wrote:
>> I'm wondering how to combine two CRCs together, knowing their initial states.
> ...
>> And what is Mr. Adler doing in "crc32.c" of the zlib code?
>
> That's DR. Adler to you buddy! :-)
> crc32_combine() in zlib is based on two key tricks. For what follows,
> we set aside the fact that the standard 32-bit CRC is pre and post-
> conditioned. We can deal with that later. Assume for now a CRC that
> has no such conditioning, and so starts with the register filled with
> zeros.

<snip>

> That's what zlib's crc32_combine() does.
>
> I will leave it as an exercise for the reader as to how to deal with
> the pre and post conditioning of the CRC register. You just need to
> apply the linearity observations above. Hint: You don't need to know
> length(A). In fact crc32_combine() only takes three arguments:
> CRC(A), CRC(B), and length(B) (in bytes).

I admire the quality and importance of your work, and this explanation
is excellent (meaning : i've been able to understand the underlying principles,
which is not that obvious :-))

However, I would like to discuss the combination in the context of "parallel"
compression : combining CRCs is possible but still complex.
OTOH, if compressed block are separately processed, why not keep their
individual CRC ? Today, 4 bytes added to a 64KB or 1MB block is ... negligible,
and allows better location of errors.

--- long story ---

I have also looked at the combination, 2 years ago, because I have found
that some recent processors (P3+ and other wide-instruction-window CPUs
like Alpha and PPC) can handle to compute 2 (or even 3 ?) concurrent CRCs
in the same core : the slowest part is always the computation of the LUT address,
so interlacing 2 CRC computation allows better use of the core.
With the proper C code, GCC3 -Os gives a very decent asm code that
reuses the proper registers.

Look at void CRC16_block_multi() in http://whygee.club.fr/sources/lm78/mds_crc16.c
it was published in the french Linux Magazine in 2005 and was tested and
benchmarked on many computers. It's targetted at CRC16 but is trivial to
adapt to CRC32. Use, abuse, hack and attribute as you wish.
Given fixed-size blocks, maybe one of the CRCs can be already initialised
with a "pre-combined value"...

BUT CRCs keep the disadvantage of the LUT (and the annoying access,
requiring several instructions on RISC CPUs and complex addressing modes
on x86, which are decoded to blocking, slow u-ops). So after further research, I have
finally designed what I called "tGFSR32x4" wich computes 4 "pseudo-CRC"
at a time with < 7 32-bit registers and no LUT. Speed is 2x faster than
normal optimised CRC32. It does not need Out-Of-Order execution and
renamed registers to be fast. It was published, documented, benchmarked etc.
in the same french magazine in october 2007. Source is available at
http://whygee.club.fr/sources/lm98/tGFSR32x4.c , is portable, endian-agnostic
and can be mis-aligned. It was designed to replace CRC16 on modern CPUs
but can also be used as a CRC32 look-alike.

There is still a problem with tGFRS32x4, however : this gives 128 bits
that must be "compacted" to 16 (or 32 in more general cases).
I have not yet found a suitable, minimal, fast hash function for this
(though i have some hints but no time to go further).
I wish i was smarter...

--- /long story ---

Yet, in the end, why is it necessary to keep a single CRC32 for a whole file,
when several CRCs can be computed by separate cores or threads, or
even 2 CRCs by a single function as in CRC16_block_multi() ?
The cost of combining CRCs is reduced, but still there, and
this does not add much safety (IMHO). AFAIK, CRC32's best error
immunity is for blocks smaller than 8KB. Signing a 1MB block
is less "immune". CRC32 wraps around after 4Gbits or 512MB
but relying on a single checksum for such a large block is not
a good bet.

Will your next efforts use individual CRCs for individual blocks ?
If 4 bytes are added every 64KB or 1MB, it's a really insignificant
overhead IMHO.

Respectful regards,
> Mark
YG

Mark Adler

unread,

Sep 21, 2008, 12:53:47 PM9/21/08

to

On Sep 21, 4:07 am, whygee <why...@yg.yg> wrote:
> OTOH, if compressed block are separately processed, why not keep their
> individual CRC ? Today, 4 bytes added to a 64KB or 1MB block is ... negligible,
> and allows better location of errors.

If I wanted to define a new format, I could put in more error
detection. The PNG format does that. The context of the discussion
however was for using crc32_combine() for the parallel construction of
already existing and widespread formats, such as gzip, which typically
have a single CRC or other error check at the end. But to answer your
question:

Yes, a CRC per some reasonable sized block is exactly the correct
thing to do when you have a real probability of bit errors that is
proportional to the length of the stream. Error-detection (and often
correction) is regularly applied on a block basis for all data
transmission and data storage.

However all that transmission and storage error detection has already
been done before we got to the compression and decompression. So what
is the CRC in gzip and other compression formats protecting against?

The three possibilities that come to mind are random CPU, memory, or
bus transmission errors in the process, rare previously undetected
systematic errors in the compression or decompression algorithms, or a
systematic error that corrupted the data somewhere between the
compressor and decompressor. Far and away the most common trigger for
a decompression CRC error is the latter. It is almost always due to a
systematic data corruption, e.g. an unintended end-of-line conversion,
a null bytes deletion or conversion, or a packet extraction and
reassembly error.

So in the end, one of the most important purposes of the CRC is for
bug isolation and finger-pointing! It's very easy, and altogether too
common, for a software developer to email the author of the
compression/decompression code about "the bug in zlib" or whatever. A
quick look at the stream being fed to the decompressor, with the aid
of the CRC and the subsequent comparison with what came out of the
compressor, will prove to the developer that they need to go back and
look more carefully at their own code before they start trying to cast
blame elsewhere.

As for "better location of errors", knowing where the errors are
doesn't really help much, or at all, with recovering the data. If you
have a true source of random errors between your compressor and
decompressor that you need to address, you should be wrapping the data
in a Reed-Solomon code to permit the correction of those errors.
Actually what you really should be doing is buying new hardware to go
between those two points, since what you are using is obviously
defective.

In summary, a single error check handles the most common case -- bugs
outside of the compression system, just as well as multiple error
checks would. Multiple error checks would indeed be a benefit if you
are seeing random errors in the processing, transmission, or storage
of the data. But in that case you've got a much bigger problem.

> The cost of combining CRCs is reduced, but still there, and
> this does not add much safety (IMHO).

If I did develop a format that used many CRCs on a per-block basis, I
would also include the overall CRC to detect another error that the
block errors wouldn't. You get all the blocks perfectly and all their
CRCs check out, but then you reassemble the blocks in the wrong
order. That error (which again might be bug somewhere) would not be
detected. A single combined CRC would detect that.

Mark

whygee

unread,

Sep 22, 2008, 1:22:29 AM9/22/08

to

Thank you for this answer, you make excellent points :-)

It is also true that my domain and application is a bit different
(I don't deal with "files" but very, very long streams, it's more
a network-like context) so I can't consider an overall CRC.
I can't also exclude other (more probable) kinds of errors,
like a spurious loss of communication (somebody accidentally walking
on a cable or some stupid thing like that), or a bad block somewhere,
and this should not break the whole thing completely
(even if the lost data can't be recovered).

What I do is similar to network packets : each one is protected with a CRC,
and I add a header with a constant signature, a sequence number and a size.
Considering that each packet could be computed by different CPUs
(it is already parallel), I focus on individual signature/CRC speed now.
Given that tGFSR32x4 is not finished, it is a good backup plan
to use the well-known CRC32. [It is easily processed in FPGA,
and some architectures provide instructions or CRC-enabled
DMA channels onchip]

So i'm curious : how was SW CRC speed doubled recently ?
I already process 4 bytes per loop and could unroll
if the prologue and epilogue size did not explode.
Unrolling CRC lookup also gives only marginal speedup
because the time is wasted computing addresses.
Have I missed a recent breakthrough ?

Regards,

> Mark
YG

Daryle Walker

unread,

Sep 22, 2008, 6:30:05 AM9/22/08

to

On Sep 18, 1:51 am, Mark Adler <mad...@alumni.caltech.edu> wrote:
> On Sep 17, 5:55 pm, Daryle Walker <dary...@gmail.com> wrote:
>
> > I'm wondering how to combine two CRCs together, knowing their initial states.
> ...
> > And what is Mr. Adler doing in "crc32.c" of the zlib code?
>
> That's DR. Adler to you buddy! :-)
>
> crc32_combine() in zlib is based on two key tricks. For what follows,
> we set aside the fact that the standard 32-bit CRC is pre and post-
> conditioned. We can deal with that later. Assume for now a CRC that
> has no such conditioning, and so starts with the register filled with
> zeros.

Actually, the conditions are what my post was mainly about. I had the
feeling that I was right, but I didn't know for sure. I'm a lot more
certain because I wrote the code and it worked.

> Trick #1: CRCs are linear. So if you have stream X and stream Y of
> the same length and exclusive-or the two streams bit-by-bit to get Z,
> i.e. Z = X ^ Y (using the C notation for exclusive-or), then CRC(Z) =
> CRC(X) ^ CRC(Y). For the problem at hand we have two streams A and B
> of differing length that we want to concatenate into stream Z. What
> we have available are CRC(A) and CRC(B). What we want is a quick way
> to compute CRC(Z). The trick is to construct X = A concatenated with
> length(B) zero bits, and Y = length(A) zero bits concatenated with B.
> So if we represent concatenation simply by juxtaposition of the
> symbols, X = A0, Y = 0B, then X^Y = Z = AB. Then we have CRC(Z) =
> CRC(A0) ^ CRC(0B).
>
> Now we need to know CRC(A0) and CRC(0B). CRC(0B) is easy. If we feed
> a bunch of zeros to the CRC machine starting with zero, the register
> is still filled with zeros. So it's as if we did nothing at all.
> Therefore CRC(0B) = CRC(B).
>
> CRC(A0) requires more work however. Taking a non-zero CRC and feeding
> zeros to the CRC machine doesn't leave it alone. Every zero changes
> the register contents. So to get CRC(A0), we need to set the register
> to CRC(A), and then run length(B) zeros through it. Then we can
> exclusive-or the result of that with CRC(B) = CRC(0B), and we get what
> we want, which is CRC(Z) = CRC(AB). Voila!

It took me a while to understand this at first, and thought I should
stick to what I got. I still am, but I think our ideas are pretty
much the same thing.

I have to modify my reasoning from last time. Those equations are
based from combining final answers from CRC computations. But I'm
working on computation objects that update a CRC _one_ submitted bit
at a time. (I'll build up to bytes later.) In this case, I have to
exclude the "x^d" factor from my equations, giving:

I1 * x^L1 + S1 = Q1 * G + C1 [3a]
I2 * x^L2 + S2 = Q2 * G + C2 [4a]
Ic * x^Lc + Sc = Qc * G + Cc [5a]

with:

Sc = S1 * x^L2 + S2 [6]
Lc = L1 + L2 [7]

Ic = I1 [15]

assuming we're extending the computation for S1 with the bytes from
S2. The updated result is:

(C1 - I2) * x^L2 + C2 == Cc (mod G) [14a]

This can be constructed by:

1. Initialize a remainder register with C1
2. "Subtract" I2 from it, i.e. C1 -> C1 xor I2
3. If L2 is zero, the "add" C2 into the register, i.e. C1 -> C1 xor
C2, and return the updated remainder
(Note that C2 will equal I2 here, so C1 will end up with its
original value, which is expected with no new bits!)
4. Else if 0 < L2 < d, "add" the upper bits of C2 onto the lower ones
of the register, i.e. C1 -> C1 xor (C2 / 2**L2)
5. Now, use the register as the focal point to feed in new bits using
the remainder-updating routine you're already using for regular CRC
work
6. If L2 > d, feed in L2 - d zero/false bit values
7. Feed in the bits of C2; all of them if L2 >= d, the lower L2 ones
if 0 < L2 < d (i.e. you did step 4)
8. The register's state replaces C1 in C1's computing object

(d is the degree of the divisor polynomial, and one greater than the
remainder's maximum degree.)

> Well, actually the voila is premature. I wasn't at all satisfied with
> that answer. I didn't want a calculation that took a time
> proportional to the length of B. That wouldn't save any time compared
> to simply setting the register to CRC(A) and running the B stream
> through. I figured there must be a faster way to compute the effect
> of feeding n zeros into the CRC machine (where n = length(B)). So
> that leads us to:
>
> Trick #2: The CRC machine is a linear state machine. If we know the
> linear transformation that occurs when we feed a zero to the machine,
> then we can do operations on that transformation to more efficiently
> find the transformation that results from feeding n zeros into the
> machine.
>
> The transformation of feeding a single zero bit into the CRC machine
> is completely represented by a 32x32 binary matrix. To apply the
> transformation we multiply the matrix by the register, taking the
> register as a 32 bit column vector. For the matrix multiplication in
> binary (i.e. over the Galois Field of 2), the role of multiplication
> is played by and'ing, and the role of addition is played by exclusive-
> or'ing.

I did decide to use a version of this trick in my code. (It would be
within the implementation of step 6 above.)

> There are a few different ways to construct the magic matrix that
> represents the transformation caused by feeding the CRC machine a
> single zero bit. One way is to observe that each column of the matrix
> is what you get when your register starts off with a single one in
> it. So the first column is what you get when the register is 100...
> and then feed a zero, the second column comes from starting with
> 0100..., etc. (Those are referred to as basis vectors.) You can see
> this simply by doing the matrix multiplication with those vectors.
> The matrix multiplication selects the column of the matrix
> corresponding to the location of the single one.

I couldn't understand this for a few hours, then I got inspired and
tried it out on paper. If a remainder GF(2) polynomial can be
represented by the vector:

r = [r0 r1 ... r_d-2 r_d-1]

then a new bit N being entered gives:

r -> r_new
[r0 ... r_d-1] -> [N r0 ... r_d-2] - r_d-1 * [G0 ... G_d-1]

where [G0 ... G_d-1] are the components of the divisor polynomial G,
excluding the always-set G_d. When N is zero, the transformation
matrix can be expressed as the left-factor:

[0 0 0 ... 0 -G0]
[1 0 0 ... 0 -G1]
[0 1 0 ... 0 -G2]
[0 0 1 ... 0 -G3]
...
[0 0 0 ... 1 -G_d-1]

with the remainder as a right-factor column vector. Note that
negation doesn't actually change the value in GF(2), just like
addition and subtraction being identical. (I think using a constant
one or an arbitrary value for N would need augmented matrices and
vectors; zeros are easy to generate without them.)

This is just exponentiation by squaring, using the binary expansion of
the exponent n, right? (For some n, there are better ways to do the
multiplication order and grouping besides binary expansion, but
they're really better only when n is fixed because finding the order
at run-time is not trivial.)

> Now we take our rapidly constructed matrix for n zeros, and multiply
> that by CRC(A) to get CRC(A0). We can compute CRC(A0) in O(log(n))
> time, instead of O(n) time. We exclusive or that with CRC(B) and
> Voila! (really this time), we have CRC(Z).

I tried to use bit-level matrices and vectors, but I got tired of it
and just used my own GF(2) type along with my framework's existing
vector and matrix types. I'm taking a size (and probably speed) hit,
but I can always optimize later. At least I know that the algorithm
works, instead of worrying that I messed up my matrix class code.

> That's what zlib's crc32_combine() does.
>
> I will leave it as an exercise for the reader as to how to deal with
> the pre and post conditioning of the CRC register. You just need to
> apply the linearity observations above. Hint: You don't need to know
> length(A). In fact crc32_combine() only takes three arguments:
> CRC(A), CRC(B), and length(B) (in bytes).

As I said, you're skipping the part I primarily need! My code just
needs those three states and the second sequence's initial remainder.
(It's a general CRC class, so I can't freeze it like you could.)

Daryle Walker

Thomas Pornin

unread,

Sep 22, 2008, 8:41:22 AM9/22/08

to

According to Mark Adler <mad...@alumni.caltech.edu>:

> If I did develop a format that used many CRCs on a per-block basis, I
> would also include the overall CRC to detect another error that the
> block errors wouldn't. You get all the blocks perfectly and all their
> CRCs check out, but then you reassemble the blocks in the wrong order.
> That error (which again might be bug somewhere) would not be detected.
> A single combined CRC would detect that.

An alternative to CRC may be the use of cryptographic hash functions,
possibly in a tree-like structure. For instance, for the hash function
h(), and the data to process consisting of n blocks B_1, B_2,... B_n,
define the overall checksum to be: h(h(B_1) || h(B_2) || ... || h(B_n))
(where '||' is concatenation). The tree structure allows for
parallelization.

For a cryptographically secure hash function, you cannot get a
collision, i.e. errors are always detected ("cannot" and "always" are in
the cryptographic sense: collisions do mathematically exist, but there
is no known method to generate one _on purpose_ with current technology;
hence, random errors may produce a collision only with very very low
probability). Actually, a cryptographically broken hash function should
also fare at least as good as a CRC, and probably better since a
"broken" hash function is a function which has been proposed as a hash
function by a supposedly smart researcher. E.g., the MD4 hash function[*]
was defined in 1990, and actual collisions were found only in 1996,
which means that the function resisted 6 years to trained researchers.
Since CRC are linear, finding collisions for them are a simple exercise
for undergrad students.

The point of hash functions is that some are very fast indeed. It seems
that on some architectures, the MD4 function is _faster_ than a 32-bit
CRC (I read such reports for ARM platforms). On my 2 GHz Athlon 64
machine (single core), I can hash data with MD4 at a rate of about 550
megabytes/s, which is less than 4 clock cycles per input byte.

(Of course there are some disadvantages as well; for instance, my
MD4 implementation adds about 3.5 kB to the code size, out of which
about 1.5 kB directly do into the main processing path, thus
competing for the scarce L1 cache[*]. Besides, MD4 processes data by
64-byte chunks, and is thus quite slower than a CRC for very short
messages. Yet hash functions should be at least considered for such
error detection jobs.)

--Thomas Pornin

[*] See http://www.ietf.org/rfc/rfc1320.txt

[**] Note that MD4 uses almost no _data_ L1 cache, and this can be a
noticeable improvement over table-based CRC implementations. Depending
on usage.

whygee

unread,

Sep 22, 2008, 11:13:56 AM9/22/08

to

Hello,

Thomas Pornin wrote:
> An alternative to CRC may be the use of cryptographic hash functions,

<snip>

> For a cryptographically secure hash function, you cannot get a
> collision, i.e. errors are always detected ("cannot" and "always" are in
> the cryptographic sense: collisions do mathematically exist, but there
> is no known method to generate one _on purpose_ with current technology;
> hence, random errors may produce a collision only with very very low
> probability). Actually, a cryptographically broken hash function should
> also fare at least as good as a CRC, and probably better since a
> "broken" hash function is a function which has been proposed as a hash
> function by a supposedly smart researcher. E.g., the MD4 hash function[*]
> was defined in 1990, and actual collisions were found only in 1996,
> which means that the function resisted 6 years to trained researchers.
> Since CRC are linear, finding collisions for them are a simple exercise
> for undergrad students.

Well, crypto hashes are designed to be "strong" :
the resulting signature requires >= 128 bits.
CRC32 gives 32 bits only, which is OK for the purpose
we are discussing now.

Non-crypto hash functions are more interesting because they
are even faster and one can use only some of the bits without
breaking the usefulness of the method.

Have a look at Bob Jenkin's work for example :
http://burtleburtle.net/bob/hash/evahash.html
http://burtleburtle.net/bob/hash/index.html

Another excellent analysis of hash functions :
http://bretm.home.comcast.net/~bretm/hash/5.html
(very good)

I am looking closely at http://burtleburtle.net/bob/c/lookup8.c
but don't know how to design a mix() function that
handles 4 words instead of 3 : I need this to hash
a 128-bit value down to 16 or 32 (for tGFSR32x4).

> The point of hash functions is that some are very fast indeed. It seems
> that on some architectures, the MD4 function is _faster_ than a 32-bit
> CRC (I read such reports for ARM platforms). On my 2 GHz Athlon 64
> machine (single core), I can hash data with MD4 at a rate of about 550
> megabytes/s, which is less than 4 clock cycles per input byte.
>
> (Of course there are some disadvantages as well; for instance, my
> MD4 implementation adds about 3.5 kB to the code size, out of which
> about 1.5 kB directly do into the main processing path, thus
> competing for the scarce L1 cache[*]. Besides, MD4 processes data by
> 64-byte chunks, and is thus quite slower than a CRC for very short
> messages.

That's an issue I have with tGFSR32x4, it's only suitable for large blocks (>1KB)

See a discussion of cryptographic hash (SHA-1) there :
http://bretm.home.comcast.net/~bretm/hash/9.html

"
Although, since it provides uniformly distributed output bits,
it could be used for hash table use, it is much slower than
any of the other hash functions evaluated here and for that
reason it is not recommended.

It's slow speed is due to the complexity of the algorithm.
The increased complexity is required to ensure resistance
to cryptanalytic or brute force attacks.
"

The analysis could be done for MD4, MD5 or others
but I presume that the result would be the roughly the same.

> Yet hash functions should be at least considered for such
> error detection jobs.)

well, "excellent" hash functions can be considered, but that's a tough subject.
"good" hash functions are not good enough.

> --Thomas Pornin
>
> [*] See http://www.ietf.org/rfc/rfc1320.txt
>
> [**] Note that MD4 uses almost no _data_ L1 cache, and this can be a
> noticeable improvement over table-based CRC implementations. Depending
> on usage.

lookup8 by bob jenkins is even better, requiring only a few registers
and a few hundred lines of C code. I wish I designed something that smart :-)

regards,
yg

Thomas Pornin

unread,

Sep 22, 2008, 6:08:19 PM9/22/08

to

According to whygee <why...@yg.yg>:

> Well, crypto hashes are designed to be "strong" :
> the resulting signature requires >= 128 bits.

(Nitpick: the word "signature" is inappropriate here.)

A hash function does not "require" its full output. It needs it to
provide its cryptographic strength, i.e. resistance to collisions.
If we only want to detect random alterations (transmission errors,
as opposed to malicious targetted modifications), then the output
can be safely truncated. That's what we are talking about here:
an output of a given hash function, truncated to 32 bits, as
a replacement to a CRC32.

Moreover, we are talking about it when it comes to checksumming
compressed data (and not about other usages such as indexing short
strings); the ability to efficiently process very short chunks of data
is not a real concern.

> Non-crypto hash functions are more interesting because they
> are even faster

That is at best arguable. You should try. You should not rely on
approximate measures performed on machines 15 years ago. In particular,
keep in mind that not all hash functions provide similar performance.

Let's take, for example, the function which code your are "looking
closely":

> I am looking closely at http://burtleburtle.net/bob/c/lookup8.c

This code includes three implementations of the same function, called
hash(), hash2(), hash3(). For the sake of the discussion, we will ignore
the various compatibility issues which make the code incorrect on any
64-bit architecture; hence, let us try on plain i386 32-bit mode.
Machine is a 2 GHz "Ahtlon 64", running Linux, with gcc-4.2.3. Let's
bench those three functions, compiled with "-O2 -fomit-frame-pointer".
We get this:

hash 230
hash2 34
hash3 383

These are expressed in megabytes per second. Hash data is an 8-kilobyte
buffer which is repeatedly processed, and the measure code performs some
"warm-up" so that everything is in L1 cache. We want to be fair.

Now, same machine, same compiler, same compilation flags, same measures,
for a few hash functions:

MD4 584
MD5 340
SHA-1 174
Panama 415

So, which is faster now ?

--Thomas Pornin

Mark Adler

unread,

Sep 22, 2008, 6:23:31 PM9/22/08

to

On Sep 21, 10:22 pm, whygee <why...@yg.yg> wrote:
> (I don't deal with "files" but very, very long streams, it's more
> a network-like context) so I can't consider an overall CRC.

> I can't also exclude other (more probable) kinds of errors, ...

You are describing the usual issues with transmission, which are best
handled at the transport layer as it sounds like you are doing. The
sorts of compression software and formats we're talking about
generally live well above the transport layer, and shouldn't have to
worry about any of that.

> So i'm curious : how was SW CRC speed doubled recently ?

> I already process 4 bytes per loop ...

Exclusive-oring the four bytes into the register is one part of the
speedup. Another part of the speedup is from the pre-generation of
tables that provide the register update for each of the four bytes in
the word. The standard algorithm already has the first of the those
tables, so this approach adds three more tables that each have 256
entries of 32 bits. What you end up with for every four bytes is: one
word fetch, four lookups, three shifts, and four exclusive-ors,
instead of: four byte fetches, four lookups, four shifts, and eight
exclusive-ors.

The four tables are simply the CRC of each possible byte (the table in
the standard algorithm), each possible byte followed by one zero byte,
each possible byte followed by two zeros, and each possible byte
followed by three zeros.

This adds a fair bit of complexity to the source code, since it needs
to deal with big-endian and little-endian architectures. There are
two different versions of the CRC routine to do this, and two
different versions of the four tables to account for the endian cases.

Mark

Mark Adler

unread,

Sep 22, 2008, 6:26:29 PM9/22/08

to

On Sep 22, 3:30 am, Daryle Walker <dary...@gmail.com> wrote:
> > I will leave it as an exercise for the reader as to how to deal with
> > the pre and post conditioning of the CRC register.

...

> As I said, you're skipping the part I primarily need!

Surely you don't want me to deprive you of the joy of discovery? Hint
2: you can consider the pre-conditioning as another stream you are
exclusive-oring with the stream that starts with a zero register.

Mark

Mark Adler

unread,

Sep 22, 2008, 6:37:04 PM9/22/08

to

On Sep 22, 5:41 am, Thomas Pornin <por...@bolet.org> wrote:
> An alternative to CRC may be the use of cryptographic hash functions,

There is some value in using a larger check value for a new
compression format, e.g. 64 or maybe 128 bits. Of course, there is no
reason for it to be cryptographic. Hash functions are good though,
since they have the same objective as a check value, which is to
spread the inputs evenly over the possible hash values.

The worry I have with cryptographic hash functions is that they would,
or at least should preclude the ability to write an efficient combine
function, a la crc32_combine(), for them. At least I think that's the
case. It seems that the ability to combine would compromise
cryptography applications.

Are there any good hash functions, cryptographic or not, that are
amenable to an efficient combine? Barring that, if I were developing
a new format I would probably just use a 64-bit CRC.

Mark

Mark Adler

unread,

Sep 23, 2008, 12:42:28 AM9/23/08

to

On Sep 22, 3:30 am, Daryle Walker <dary...@gmail.com> wrote:

> When N is zero, the transformation
> matrix can be expressed as the left-factor:
>
> [0 0 0 ... 0 -G0]
> [1 0 0 ... 0 -G1]
> [0 1 0 ... 0 -G2]
> [0 0 1 ... 0 -G3]
> ...
> [0 0 0 ... 1 -G_d-1]

You've definitely got the right idea. I use a different convention
for where least significant bits go, so I'm not certain yours is
correct. All you need to do is try the matrix multiplication using
your conventions and compare the result to a known crc. E.g. start
with all ones in the register, multiply by that matrix eight times,
take the ones complement of the result, and compare that to the
standard crc of a zero byte. My matrix looks like: [g0 1 0 ...] [g1 0
1 0 ..]...[g30 0 0 ... 1] [g31 0 ... 0]. I put the least significant
bits first. My polynomial may also be reversed compared to yours.

> This is just exponentiation by squaring, using the binary expansion of
> the exponent n, right?

Exactly.

Yes, there are ways to reduce the number of multiplications as
compared to this straightforward approach, but it requires more work
to find the scheme for a given n than you save on the multiplications.

Mark

whygee

unread,

Sep 23, 2008, 6:24:46 AM9/23/08

to

Hello !

Thomas Pornin wrote:
> According to whygee <why...@yg.yg>:
>> Well, crypto hashes are designed to be "strong" :
>> the resulting signature requires >= 128 bits.
>
> (Nitpick: the word "signature" is inappropriate here.)

shame on me /o\

> A hash function does not "require" its full output.

If certain bits are not used, then there are useless
and avoidable computations.

> It needs it to
> provide its cryptographic strength, i.e. resistance to collisions.

well, cryptographic hashes have more contraints than algorithms
targeted at lookup tables.
One of the first things is the minimal length, because hash keys
can be very short. This makes them compatible with usual CRCs.

> If we only want to detect random alterations (transmission errors,
> as opposed to malicious targetted modifications), then the output
> can be safely truncated.

meaning that you throw away a lot of computations :-/

> That's what we are talking about here:
> an output of a given hash function, truncated to 32 bits, as
> a replacement to a CRC32.

yep, more or less, it's the idea.

> Moreover, we are talking about it when it comes to checksumming
> compressed data (and not about other usages such as indexing short
> strings); the ability to efficiently process very short chunks of data
> is not a real concern.

It could be, depending on the application.

>> Non-crypto hash functions are more interesting because they
>> are even faster
>
> That is at best arguable. You should try. You should not rely on
> approximate measures performed on machines 15 years ago. In particular,
> keep in mind that not all hash functions provide similar performance.

And what if I "like" 15-y old machines ? :-)
I have some machines that are > 20 years old...

Seriously, the real point is the fact that you can't count on your
application to run on the latest Opteron. I also deal
a lot with embedded systems or DSPs, and the characteristics are
far from x86_64. If the algo flies on a 486, ARM or MIPS, I'm happy
because the application will run smoothly on almost all other architectures.
Some more tweaking here and there, and OOO CPUs will be happy too.

In the end, ARM, MIPS, PPC and 486 are older than 15 years but they
can be found anywhere, everywhere : in your mobile phone, you PDA etc.

What would be ideal is a hash function that works very well with
an ARM-like CPU (10x 32-bit registers), and is "scalable" to exploit
word-level parallelism, like 128-bit registers, to speed up things
further. I was not able to achieve that with tGFSR32x4, but I don't
think it's impossible.

> Let's take, for example, the function which code your are "looking
> closely":
>> I am looking closely at http://burtleburtle.net/bob/c/lookup8.c

<snip>

> We get this:
>
> hash 230
> hash2 34
> hash3 383

<snip>

> Now, same machine, same compiler, same compilation flags, same measures,
> for a few hash functions:
>
> MD4 584
> MD5 340
> SHA-1 174
> Panama 415
>
> So, which is faster now ?

I'm sorry that I pointed to the 1997 version of the code,
a better version was released in 2006.
Have a look at http://burtleburtle.net/bob/c/lookup3.c

as noted at http://burtleburtle.net/bob/hash/doobs.html :
"Update: I'm leaving the old hash in the text below, but I've got a new hash
at http://burtleburtle.net/bob/c/lookup3.c that is roughly twice as fast
and more thorough than the one below. It's designed along the same lines
as the hash below, 12 byte blocks, switch statements, etc. The biggest
theoretical distinction is it has different mixing functions for the
last block than for all but the last block, at 21 and 24 instructions,
instead of the 36 instruction mix below that serves for both. "

A 64-bit version would be almost 2x faster.

Note also from the author's site :
"
If your messages are aligned to word boundaries and are long (100 bytes or more),
look at Rogaway's bucket hash. That can be much faster than LOOKUP3, but it
only works for long aligned messages. It has lots of 4-bit funnels, but
no funnels smaller than 4-bits, so it's not too bad.
"
I have not looked at this, though.
There has been so much research in these fields that I don't
know where to start...

> --Thomas Pornin
yg

whygee

unread,

Sep 23, 2008, 6:49:09 AM9/23/08

to

Hello,

Mark Adler wrote:
> On Sep 21, 10:22 pm, whygee <why...@yg.yg> wrote:
>> So i'm curious : how was SW CRC speed doubled recently ?
>> I already process 4 bytes per loop ...

> Another part of the speedup is from the pre-generation of
> tables that provide the register update for each of the four bytes in
> the word. The standard algorithm already has the first of the those
> tables, so this approach adds three more tables that each have 256
> entries of 32 bits. What you end up with for every four bytes is: one
> word fetch, four lookups, three shifts, and four exclusive-ors,
> instead of: four byte fetches, four lookups, four shifts, and eight
> exclusive-ors.
>
> The four tables are simply the CRC of each possible byte (the table in
> the standard algorithm), each possible byte followed by one zero byte,
> each possible byte followed by two zeros, and each possible byte
> followed by three zeros.

Wow, I get it :-) Who found this trick ?
That's smart and has the potential to exploit OOO cores.
However, on single-issue CPUs there will be no speedup
so one will revert back to the previous method.
In the end, it's just a #define here or there...
I'll give it a try ASAP !

> This adds a fair bit of complexity to the source code, since it needs
> to deal with big-endian and little-endian architectures. There are
> two different versions of the CRC routine to do this, and two
> different versions of the four tables to account for the endian cases.

That's difficult the first times, but later we're used to code like this :-)

Thank you again for your clear and useful explanations,

> Mark
YG

Thomas Pornin

unread,

Sep 23, 2008, 6:58:07 AM9/23/08

to

According to Mark Adler <mad...@alumni.caltech.edu>:

> The worry I have with cryptographic hash functions is that they would,
> or at least should preclude the ability to write an efficient combine
> function, a la crc32_combine(), for them.

That's where a tree structure is handy. If the checksum of a sequence
of blocks is the hash of the concatenation of hashes of each block,
then hashes over the individual blocks can be computed in parallel.
This is not as good as crc32_combine(), but this is good enough to
allow for some of the usages of crc32_combine().

--Thomas Pornin

whygee

unread,

Sep 23, 2008, 7:23:11 AM9/23/08

to

Mark Adler wrote:
> On Sep 22, 5:41 am, Thomas Pornin <por...@bolet.org> wrote:
>> An alternative to CRC may be the use of cryptographic hash functions,
>
> There is some value in using a larger check value for a new
> compression format, e.g. 64 or maybe 128 bits.

One possible use is the recovery of one altered bit, right ?

> Of course, there is no
> reason for it to be cryptographic. Hash functions are good though,
> since they have the same objective as a check value, which is to
> spread the inputs evenly over the possible hash values.
>
> The worry I have with cryptographic hash functions is that they would,
> or at least should preclude the ability to write an efficient combine
> function, a la crc32_combine(), for them. At least I think that's the
> case. It seems that the ability to combine would compromise
> cryptography applications.

obviously.

> Are there any good hash functions, cryptographic or not, that are
> amenable to an efficient combine?

From the little I know, only CRCs are able to do this because
they are linear.
Maybe Jenkin's hash could be a candidate, because
it is based on reversible operations (but are they linear ?)

> Barring that, if I were developing
> a new format I would probably just use a 64-bit CRC.

FWIW, I have also played with "parallel CRC", or a derivation
of the Mersenne Twister, with several 64-bit registers :
a good implementation can make a really, really fast CRC.
For example, with MMX, a 7-word LFSR fits into the register
set. Computation becomes memory-bound... Because 7 bits wrap
after 127 steps, further mixing is provided with a "lateral"
CRC : a "twister Generalized Feedback Shift Register"
can have very, very long periods but it's difficult
to prove that one configuration has maximal length.

Such long words limit the use to only 64-bit enabled platforms,
so I scaled the thing down to 4 registers of 32 bits,
giving "tGFSR32x4" which returns 128 bits. Each sub-tGFSR cycles
after 268M iterations (when fed with 0s), but 4*tGFSR cover
1G 0-bytes before wrapping around. Feeding non-0 bytes covers
the 4G states. Oh, and because it's based on LFSR, it's linear.
But I have not looked at how to "combine" tGFSR32x4 results,
because I intend "hash" the 128 bits down to 16 or 32, which
is not going to be linear... (CRCing the 128 bits will be too long)

Based on these experiments, I believe that you can
design your own 64-bit hash or CRC that is very fast,
endian-agnostic, alignment-proof, solid etc.

Regards,
> Mark
yg

whygee

unread,

Sep 23, 2008, 7:37:24 AM9/23/08

to

May I make a few remarks ?
It all comes from the fact that a hash is a one-way function,
contrary to the CRC which is linear, reversible...

In CRCs, the value of the partial result depends on the position of the scanned block.
In order to do get the same quality of result, you should hash each block
along with a word that stores the position of the block.

Furthermore, because a hash is not linear (usually), you can't
"remove" one hash from the final hash. You have to recompute the whole hash.
With a CRC, it's just a XOR...

So you have to store all the partial hashes in the file.
Nothing wrong with that, though :-) But crc32_combine()
allows more flexibility (if I understand correctly).

regards,
> --Thomas Pornin
YG (hashes to hashes...)

Thomas Pornin

unread,

Sep 23, 2008, 7:59:14 AM9/23/08

to

According to whygee <why...@yg.yg>:

> If certain bits are not used, then there are useless
> and avoidable computations.

Be my guest. You are just arguing that hash functions are even
faster than what I was claiming.

Although the gain -- the amount of computations done "for nothing" -- is
very small on a single block, and totally negligible on any message
longer than that. And I daresay it is obvious when simply looking at the
considered hash functions. Go read RFC 1320.

> well, cryptographic hashes have more contraints than algorithms
> targeted at lookup tables.

We are not, and never were, talking about lookup tables. We are talking
about checksums over compressed data, to quickly detect transmission
errors or library misuse.

> Seriously, the real point is the fact that you can't count on your
> application to run on the latest Opteron.

I should not be doing your homework. But I do have embedded systems
around. For testing purposes, I have set up a cheap Linksys router now
running under Linux (OpenWRT), and it features a MiPS core (a 200 MHz
Broadcom BCM3302). Compiler is gcc-4.2.4. Which yields:

hash 11.9
hash2 (segfault)
hash3 19.9

MD4 23.5
MD5 14.5
SHA-1 6.4
Panama 14.2

(Yeah, I too deal with a lot of embedded systems. I am immune to that
specific brand of discussion-winning argument.)

(The segfault is due to the sloppy programming of the lookup8.c code: it
performs direct accesses with little to no care about alignment or
endianness.)

What all this amounts to is the following:

There is a quite common idea, which somehow states that anything which
is related to cryptography is necessarily slow, and that homebrewed
functions are faster. This idea is often held by people who have not
bothered to even have a look at a single cryptographic hash function
(and I am sorry to tell it, but your comment about how "many
computations are useless when the output is truncated" makes it very
obvious that your are one of those people). It so happens that this idea
is very wrong (and has been so for quite some time).

This statement equally applies to hash functions, pseudo-random number
generators, and symmetric encryption. At least.

This does NOT mean that homebrewed functions are _necessarily_ bad, or
that cryptographic hash functions _always_ beat them. It just happens
very often. That's what you get when people do not care to look at
existing research before doing their own.

It is _possible_ that by dismissing the need for cryptographic strength,
you may get better performance. Right now, you are talking about
dismissing cryptographic strength and obtaining worse performance, for
the sake of a "pearl of wisdom" which was already wrong 15 years ago.

> There has been so much research in these fields that I don't know
> where to start...

Well, just start by heeding what I write. It is true that there has been
a lot of research on hash functions. Cryptographers in particular have
strong needs and care a lot about performance. Many of them are actively
working on hash functions (namely because existing functions are quite
lacking in terms of security, and the NIST has launched an open
competition for a new hash function[*]). I so happen to be one of those
researchers.

Then if I tell that there are cryptographic hash functions that are
worth investigating, even for non-cryptographic purposes and for the
sake of performance, then HAVE A LOOK AT THEM before dismissing them
with peremptory handwavings. Dammit.

--Thomas Pornin

[*] See http://csrc.nist.gov/groups/ST/hash/sha-3/index.html
Submissions are to be sent before the end of October. There will be
plenty of activity in the next few years.

whygee

unread,

Sep 23, 2008, 1:40:12 PM9/23/08

to

Hello,

I did not want to trigger some bad mood or something like that.
Also, I'm just here to discuss and share experiences,
in the hope to learn and make a better world for everybody.
But this discussion has gone way off-topic of comp.compression,
and it is not going to be very constructive, so here is my last
post in this branch of the thread.

Thomas Pornin wrote:
> According to whygee <why...@yg.yg>:
>> If certain bits are not used, then there are useless
>> and avoidable computations.
>
> Be my guest. You are just arguing that hash functions are even
> faster than what I was claiming.

Let's just say that my experience has taught me that
"classical CRCs" have some drawbacks and I believe that faster methods
exist that produce similar or better results.
"hash functions" are an interesting family of algos to explore.
In parallel, I'm going to experiment with the lookup table trick that
Dr Adler just explained to me.

> Although the gain -- the amount of computations done "for nothing" -- is
> very small on a single block, and totally negligible on any message
> longer than that. And I daresay it is obvious when simply looking at the
> considered hash functions. Go read RFC 1320.

done.

it looks like many other algos published in Schneier's book
(at least the loosy french 2nd version I offered myself a long time ago)

These algos have a large "context" or sliding window, which puts
a lot of pressure on memory accesses when the registers are scarce.
This is necessary for MDx/SHAx but less critical for non-crypto use
(rough guess, since jenkins uses 3 register instead of 16).

>> well, cryptographic hashes have more contraints than algorithms
>> targeted at lookup tables.
> We are not, and never were, talking about lookup tables. We are talking
> about checksums over compressed data, to quickly detect transmission
> errors or library misuse.

Hash functions can be used to make a fingerprint of incoming data
- for authentification purposes => cryptographic constraints apply
- for lookup in a hash table => less stringent contraints but still...
- for detection of alterations as a replacement of CRC
==> "combine()" requires the transform (CRC or hash) to be linear,
which makes it absolutely unsuitable for crypto use.

Whether the hash function should be crypto-class or not, depends on
the application. My own use does not requires this. I'm just a performance
freak, I care a lot (excessively ?) about computational efficiency.

BTW, if you know a minimal hash function that hashes 4 32-bit words down to 32 bits
or 16 bits, I'm interested. The best candidate is Jenkin's recent code,
but it's still a bit too long...

>> Seriously, the real point is the fact that you can't count on your
>> application to run on the latest Opteron.
>
> I should not be doing your homework.

I'm not a student for so long that I gladly forget the taste of homework ;-)

> But I do have embedded systems around.

cool ! \o/
what kinds ?

> For testing purposes, I have set up a cheap Linksys router now
> running under Linux (OpenWRT), and it features a MiPS core (a 200 MHz
> Broadcom BCM3302).

Some friends run tests on their own OpenWRT for me some times.
I still had no time to setup my Microchip PIC32 (MIPS) eval board.
Besides, my workshop has many 16, 32 and 64 bit machines, LE and BE,
so I can test my code too. Plus countless boards in PC104, PICMG, ETX,
PCISA, ISA96, 5"1 form factor...

If you want to test code on even more machines, hurry to
http://www.testdrive.hp.com/ because it's going to shut down soon :-(((

> Compiler is gcc-4.2.4. Which yields:

with -Os or -O2 ?

> hash 11.9
> hash2 (segfault)
> hash3 19.9
>
> MD4 23.5
> MD5 14.5
> SHA-1 6.4
> Panama 14.2

hash3 doesn't look /that/ bad...

and MD4 runs very good on MIPS thanks to the large register set
and 3-address instruction format. There would be a hit on ia32
but i'm too lazy to test it now on the 5x86 on my desk.

Unfortunately, the predominance of x86 forces many to "scale down"
the algorithms, and x86-64 is not available on the majority of the
deployed systems :-/

> (Yeah, I too deal with a lot of embedded systems. I am immune to that
> specific brand of discussion-winning argument.)

My goal is to gain knowledge, not to "win arguments" because this never happens
(particularly on Usenet, right ?). I'm not here to prove you wrong or things like that.

> (The segfault is due to the sloppy programming of the lookup8.c code: it
> performs direct accesses with little to no care about alignment or
> endianness.)

at this point in development and optimisation, this is likely to happen.
One must be careful when using such functions. Usually, this is commented
or indicated in the source code. The author probably forgot to test in BE ?

> What all this amounts to is the following:
>
> There is a quite common idea, which somehow states that anything which
> is related to cryptography is necessarily slow, and that homebrewed
> functions are faster.

I'm sorry but I have never thought or written such thing,
because I know well that an incredible amount of efforts is
put into the optimisation of these algorithms.

Yet, when security is not in a project's scope,
I guess that it is legitimate to look at other solutions,
which /could/ imply less burden. I guess that it's
the heart of the debate here.

Some situation or contexts can also carry cultural biases,
and in these cases, any handwavings are useless...
Which does not make anybody "right" or "wrong".

> This idea is often held by people who have not
> bothered to even have a look at a single cryptographic hash function
> (and I am sorry to tell it, but your comment about how "many
> computations are useless when the output is truncated" makes it very
> obvious that your are one of those people).

I sometimes speak/write things "not the way they should be expressed"
but, even when misled or misinterpreted, I don't think this makes
me an idiot. Furthermore, not being an expert like you does not
mean that I know absolutely nothing about this field.

And being versed in the art of cryptography would even make
me an easy target in an endless and fruitless "discussion",
as you seem really eager to have a point.

> It so happens that this idea
> is very wrong (and has been so for quite some time).

alright.

> This statement equally applies to hash functions, pseudo-random number
> generators, and symmetric encryption. At least.

Probably but we're slowly going off-topic here :
this group is meant to discuss about compression...

Maybe, if I had time, I would shift this discussion to
a list or forum that was opened recently, about "non-crypto hashes"
(but i'm subscribed yet, I lost the address).

> This does NOT mean that homebrewed functions are _necessarily_ bad, or
> that cryptographic hash functions _always_ beat them. It just happens
> very often. That's what you get when people do not care to look at
> existing research before doing their own.

You can't keep people from learning by themselves, either.
It's often the only way for many "hobbyists" not only to learn
but also understand. We all learn by making mistakes too, right ?

> It is _possible_ that by dismissing the need for cryptographic strength,
> you may get better performance.

Unfortunately, this "eventuality" or "potential angle of experiment"
does trigger alarm bells with you, it seems to be a very sensitive point.

> Right now, you are talking about
> dismissing cryptographic strength and obtaining worse performance, for
> the sake of a "pearl of wisdom" which was already wrong 15 years ago.

Can you elaborate ? I will not answer,
but you seem to make informed statements,
even if we don't know each other.

>> There has been so much research in these fields that I don't know
>> where to start...
>
> Well, just start by heeding what I write. It is true that there has been
> a lot of research on hash functions. Cryptographers in particular have
> strong needs and care a lot about performance.

I know that well.

> Many of them are actively
> working on hash functions (namely because existing functions are quite
> lacking in terms of security,

I may be misinformed, but the recent hash functions are quite suitable
for everyday use. Now, I don't work for any secret service, and I rarely
use encryption (except https). Things can always be better, but I don't
understand your vague statement that
"existing functions are quite lacking in terms of security".
Am I over-confident ?
And if they are "lacking", why do you promote their use as a CRC replacement ?

> and the NIST has launched an open
> competition for a new hash function[*]).
> I so happen to be one of those researchers.

Now I understand why you are so sensitive about this subject.
I am not involved in the SHA3 debate so ...
excuse me if I'm immune (?) to some of your arguments.
Which does NOT diminish the quality or importance of
anyone's work. I just have to stay focused on my duties :-/

> Then if I tell that there are cryptographic hash functions that are
> worth investigating, even for non-cryptographic purposes and for the
> sake of performance, then HAVE A LOOK AT THEM before dismissing them
> with peremptory handwavings. Dammit.

I never meant to anger anybody so please relax.

Besides, I looked at MD4 and it does not look interesting for the applications
that are on my drawing board. For the main project that keeps me busy now,
I just use CRC32 on 64KB blocks, and that's all. Everybody in the targetted
market knows what a CRC32 is, and would scratch their head if MD4 was used.
The customers are mechanical engineers, not cryptographers.
Even the simple techniques I use are quantum leaps compared to their
current "state of the art" so I try to avoid misunderstandings.

Should I be blamed for not using a "strong" hash when it's not
necessary or useful ? This group deals with compression, which
is equivalent to "economy", and the simpler, the better.
Even if it is "sub-optimal" (in some contexts).

Now, computing CRC32 with a FPGA is really, really simple
(it's going to eat several SRAM blocks but there are plenty available).
If I use MD4, latency and complexity explodes (I do my best
to avoid state machines, multicycle computations and shift/rotations).
CRC and derivatives like tGFSR don't have these issues.

> --Thomas Pornin
>
> [*] See http://csrc.nist.gov/groups/ST/hash/sha-3/index.html
> Submissions are to be sent before the end of October. There will be
> plenty of activity in the next few years.

I don't consider myself good at cryptography (the more I know,
the more i know i don't know) and I already have to finish
some compression work. Have fun anyway, and I count on you
to explain all the subtleties of the winner's submission.

And now, I'm back to work,
yg

Mark Adler

unread,

Sep 23, 2008, 4:54:32 PM9/23/08

to

On Sep 23, 3:49 am, whygee <why...@yg.yg> wrote:
> Wow, I get it :-) Who found this trick ?

Rodney Brown first sent me the algorithm in April 2002. The first
publicly released version of zlib with it was November 2003. (The
first beta was March 2003.)

The exclusive-or in a word was shown by Richard Black in 1993 here:

http://www.cl.cam.ac.uk/research/srg/bluebook/21/crc/node6.html#SECTION00060000000000000000

Intel has a document from 2005 on another CRC algorithm, possibly
related (I haven't had a chance to look into it yet):

http://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf

http://www.intel.com/technology/comms/perfnet/download/CRC_generators.pdf

It may be the same thing, but a 64-bit fetch and eight tables instead
of a 32-bit fetch and four tables.

Mark

Mark Adler

unread,

Sep 23, 2008, 6:42:16 PM9/23/08

to

On Sep 23, 1:54 pm, Mark Adler <mad...@alumni.caltech.edu> wrote:
> It may be the same thing, but a 64-bit fetch and eight tables instead
> of a 32-bit fetch and four tables.

Yep, it is.

Mark

Thomas Pornin

unread,

Sep 23, 2008, 6:46:19 PM9/23/08

to

According to whygee <why...@yg.yg>:

> it looks like many other algos published in Schneier's book

Schneier's book is an introduction book but does not contain original
research. This means that the algorithms described in Schneier's book
were published beforehand, and thus what Schneier describes looks
like the previously published algorithms, not the other way round.

(Note that Schneieg wrote several books. I assume you are talking
about "Applied Cryptography".)

> I'm just a performance freak, I care a lot (excessively ?) about
> computational efficiency.

That's the point. If you care about efficiency then you really should at
least try some hash functions. Since they are fast.

> with -Os or -O2 ?

-O2. -Os makes code more compact but -O2 usually yields better
performance when the critical loop fits in the L1 cache.

> hash3 doesn't look /that/ bad...
>
> and MD4 runs very good on MIPS thanks to the large register set
> and 3-address instruction format. There would be a hit on ia32
> but i'm too lazy to test it now on the 5x86 on my desk.

Tests on ia32 were _precisely_ those I wrote down on my previous
message. The one where MD4 runs at 584 MB/s where hash3() tops
below 400 MB/s. The CPU is an Athlon64 but I ran the tests in
32-bit mode (i.e. compatibility mode with the 80386) because the
lookup8.c code is not portable enough to properly compile or run
on a 64-bit architecture.

And the point is not to make hash3 look bad; the point is to show that
the assertion "cryptographic hash functions cannot provide such
performance" does not hold.

> > (The segfault is due to the sloppy programming of the lookup8.c code: it
> > performs direct accesses with little to no care about alignment or
> > endianness.)
>
> at this point in development and optimisation, this is likely to happen.

Likely but nonetheless sloppy. My hash function code handles similar
issues quite cleanly. Because portability is actually good for
optimization: portability issues indicate that the programmer is making
things "out of specification", which often implies that the C compiler
is unlikely to grasp what really happens in the code. This impairs the
work of the code generator. By applying proper workarounds, one usually
promotes portability _and_ performance simultaneously.

Or, in other words, first step to optimization is to make sure that
the compiler knows everything it should about the program.

> I guess that it is legitimate to look at other solutions,

Yet again, the whole point. I am arguing that it is legitimate to
consider hash functions where CRC are usually used. And ruling them
out based on an unsupported prejudice about performance is just bad
engineering.

I am NOT telling, and have never told, that CRC should not be used and
that MD4 should always be preferred. If you read that in my messages,
then you read wrong (theoretically, it is possible that I did not
express myself clearly, but my ego will not let me believe that).

> You can't keep people from learning by themselves, either.
> It's often the only way for many "hobbyists" not only to learn
> but also understand. We all learn by making mistakes too, right ?

Learning by making mistakes works only when one notices that he made
mistakes. Thanks to this discussion, you learnt that cryptographic hash
functions are not necessarily slower than CRC or other non-cryptographic
hash functions. Or at least I wrote it plainly, with explanations
and measures. It is now up to you to learn.

> Can you elaborate ?

Well, _you_ cited an "analysis of SHA-1" which was about 15 years old.

> I may be misinformed, but the recent hash functions are quite suitable
> for everyday use.

Depending on what you mean by "recent hash functions". SHA-256 and
SHA-512 seem decent but have not been analyzed thoroughly, and their
performance is a bit low. HTTPS uses thoroughly MD5 and SHA-1, both of
which have known weaknesses. The published weaknesses (namely, a method
to compute collisions) do not affect HTTPS proper, but are an issue for
X.509 certificates which HTTPS uses.

> but I don't understand your vague statement that
> "existing functions are quite lacking in terms of security".
> Am I over-confident ?

From the cryptographic point of view, existing, deployed hash functions
are not considered "secure enough". SHA-1 is in the "grey zone" where
no actual attack was demonstrated, but too many weaknesses were found
to rely on it in full confidence.

That lack of security explains the actual effort for defining new
functions, which in turn implies that there soon will be new functions
available for non-cryptographic uses as well.

> And if they are "lacking", why do you promote their use as a CRC
> replacement ?

A CRC has no security, in the cryptographic sense (generating
collisions, or even finding pre-images, is trivial). Hence there is no
problem in replacing a CRC with a hash function which, cryptographically
speaking, is broken. A prime example is MD4.

What I "promote" is the following: there are functions, originating from
the field of cryptography, which are quite good at detecting random
errors, at least as good as CRC. Moreover, these functions also provide
good performance (better than CRC or other similar designs on at least
some platforms). Hence they should be at least considered. There can be
a full host of good reasons why a given function should not be used in a
given situations (e.g. code size, difficulty to implement on a given
platform,...). However, the reason you first invoked ("it's
cryptography, hence it's slow") is just plain wrong, and that is what I
pointed out (at length, with figures). It is a common misconception, but
wrong nonetheless.

Now you may call _me_ sensitive if you want to keep on being wrong.
That's your call.

--Thomas Pornin

whygee

unread,

Sep 23, 2008, 7:31:01 PM9/23/08

to

Mark Adler wrote:
> On Sep 23, 3:49 am, whygee <why...@yg.yg> wrote:
>> Wow, I get it :-) Who found this trick ?
>
> Rodney Brown first sent me the algorithm in April 2002. The first
> publicly released version of zlib with it was November 2003. (The
> first beta was March 2003.)

<snip>
Thank you very much !

> Mark
YG

Mihai Cartoaje

unread,

Sep 24, 2008, 12:35:38 PM9/24/08

to

whygee wrote:

> However, I would like to discuss the combination in the context of "parallel"
> compression : combining CRCs is possible but still complex.
> OTOH, if compressed block are separately processed, why not keep their
> individual CRC ? Today, 4 bytes added to a 64KB or 1MB block is ... negligible,
> and allows better location of errors.

YG is right. Users who download the beginning of an archive should be
able to preview it.

whygee

unread,

Sep 27, 2008, 2:00:47 PM9/27/08

to

Hi,

Mark Adler wrote:
> On Sep 23, 3:49 am, whygee <why...@yg.yg> wrote:
>> Wow, I get it :-) Who found this trick ?
>
> Rodney Brown first sent me the algorithm in April 2002. The first
> publicly released version of zlib with it was November 2003. (The
> first beta was March 2003.)

It seems that thanks to your description only, I was able to
catch up these years with a few hours of work. On a P3, the 2-tables code
runs 50% faster and the 4-tables 100% faster (in cache).
I'm amazed because I didn't think that CRC32 could get any faster before...

Of course, YMMV depending on the CPU so my code includes 6 versions,
from the most basic to the heaviest. Also, I didn't think that
the 4-LUT code would be faster than 2-LUT on P3 but apparently
gcc does a good job with addressing and the registers.

This is preliminary and I didn't even look at the zlib source code.
However, I looked at some recent Linux kernel and the multi-LUT
trick does not seem to be used.

Thank you again,
> Mark
YG

Qubeley

unread,

Sep 28, 2008, 12:34:54 AM9/28/08

to

Cool.

I implemented the 64bit version in my zip library and it is 25% ~ 30%
faster than 32bit version (slicing by four) used in zlib. Though both
version use one table lookup and an XOR for one byte, the number of
instructions 64bit version used to process a fixed chunk (say 64bit)
is smaller than 32bit version, by observing disassembly in the binary.
It's good for ZLIB to investigate and adopt it, if the algorithm
described in the paper is in public domain (is it?)

Qubeley

unread,

Sep 28, 2008, 12:41:19 AM9/28/08

to

Also, CRC will be part of SSE4 of Nehalem

ftp://download.intel.com/technology/architecture/new-instructions-paper.pdf

Mark Adler

unread,

Sep 28, 2008, 1:29:52 AM9/28/08

to

On Sep 27, 9:34 pm, Qubeley <liangha...@gmail.com> wrote:
> It's good for ZLIB to investigate and adopt it, if the algorithm
> described in the paper is in public domain (is it?)

As far as I can tell, Intel makes no claims on the algorithm. Even if
they did, Rodney Brown's implementation pre-dates their papers, s I'm
not worried.

I may indeed incorporate it into a subsequent version of zlib.

Mark

Mark Adler

unread,

Sep 28, 2008, 2:54:51 AM9/28/08

to

On Sep 27, 9:41 pm, Qubeley <liangha...@gmail.com> wrote:
> Also, CRC will be part of SSE4 of Nehalem
>

> ftp://download.intel.com/technology/architecture/new-instructions-pap...

Dangnabit! I was excited for a minute there, but then discovered that
that instruction uses a fixed polynomial that *isn't* the same
polynomial that zip/gzip/zlib, etc. use.

I also found an interesting comment in the paper linked above:
"Improved performance for virus scan, text search, string processing
libraries like ZLIB, databases, compilers and state machine-oriented
applications."

Really? They never told me about it! There isn't anything planned
for zlib to take advantage of the new instructions.

Mark

Qubeley

unread,

Sep 28, 2008, 5:10:07 AM9/28/08

to

Exactly. I think it's using CRC-32C while ZLIB/GZIP/ZIP used CRC-32.
May you talk with Intel to tell them add CRC-32 in SSE4 <grin>

So if CRC-32 doesn't appear in SSE4 then software based CRC generator
would live for a while. Future optimizations would be to leverage
multicores by parallel processing slices, doing parallel table look
ups, etc.. comparing with current sequential processing with many
general purpose registers and one slice at a time.. it'll be
interesting to see how this would perform comparing with doing
multiple similar CRC computing on different cores then combine them
together using crc_combine...

Daryle Walker

unread,

Oct 5, 2008, 2:39:30 AM10/5/08

to

I think that'll give the result I've already determined. I thought
about it as what happens when the register is full with the original
message bits and what happens when new message bits are added. Then I
work backwards to handle when the first bits of the original message
are un-put until the initial remainder shows, including the trivial
case of the I.R. being zero. (And I said that my idea already works
in code, but that isn't definitive, of course.)

Daryle Walker

jon...@gmail.com

unread,

Apr 22, 2014, 9:07:14 AM4/22/14

to

Dr. Adler's 2 hints weren't enough for me at first, but the solution came to me in the middle of the night. Here is a solution that accounts for a non-zero CRC seed (a pre-condition in the CRC register):

Consider two sequences whose CRCs we want to combine: A and B
Goal: calculate the CRC of AB.
Inputs to algorithm: CRC(A), CRC(B), length of B

Algorithm:
1. Generate sequence C = all zeros, same length as B
2. CRC(AB) = CRC(AC) ^ CRC(B) ^ CRC(C)

(^ means XOR)