Interesting Deflate source

3 views
Skip to first unread message

encode

unread,
Apr 20, 2008, 3:31:00 PM4/20/08
to encode_ru_f...@googlegroups.com


Look at this source:
http://cm.bell-labs.com/sources/plan9/sys/src/libflate/deflate.c

Especially, I was interested in a hash function. This implementation uses multiplicative hashing:
 
#define hashit(c) (((ulong)(c)&0xffffff)*0x6b43a9b5)>>(32-HashLog))


The value 0x6b43a9b5 (1799596469) is not even a prime, but I tested such constant with BALZ for 3-byte hashing and it is one of the best so far.

Shelwien

unread,
Apr 20, 2008, 4:01:00 PM4/20/08
to encode_ru_f...@googlegroups.com


Why don't you tune your hash to the data, instead of using some random values, prime or not?

encode

unread,
Apr 20, 2008, 4:34:00 PM4/20/08
to encode_ru_f...@googlegroups.com


Quoting: Shelwien

Why don't you tune your hash to the data

I do. I carefully test the multiplier on the real files, collecting various performance data! ;)

Bulat Ziganshin

unread,
Apr 20, 2008, 4:53:00 PM4/20/08
to encode_ru_f...@googlegroups.com


Quoting: encode
The value 0x6b43a9b5 (1799596469) is not even a prime,

the value shouldn't be a prime. primeness is just some sign of rather good multipliers i think that this multiplier will be not much different to your one. actually, once i've tested your prime against ideal hashing and found a very little difference

multiplication scheme has only one serious cabeat - highest bits of hashed value affects only highest bits of resulting hash value. smth like this:

(x+(x>>16))*123456791

may improve hashing quality

Shelwien

unread,
Apr 20, 2008, 5:50:00 PM4/20/08
to encode_ru_f...@googlegroups.com


> highest bits of hashed value
> affects only highest bits of resulting hash value. smth like this:
> (x+(x>>16))*123456791
> may improve hashing quality

(x+(x>>16)) ~= (x+x/65536) = x*(65537/65536)
x*(65537/65536)*123456791 = x*123458674.801132
So I'm not so sure about that.


>> Why don't you tune your hash to the data
> I do. I carefully test the multiplier on the real files,
> collecting various performance data! ;)

Somehow I think that you just manually try out and compare some
selected hash functions/parameters.

And I was talking about
1. Log the memory access pattern for some dataset
with unhashed contexts for addresses
(reads/writes; also other parameters like value size
and offset in hash cell)
2. Measure the processing speed for all 2^32 multiplier values
(or including other hash functions too), keeping some number
of best results.
3. Test the real compressor performance with these stored
best hash configurations.

encode

unread,
Apr 20, 2008, 6:05:00 PM4/20/08
to encode_ru_f...@googlegroups.com


Quoting: Shelwien

Somehow I think that you just manually try out and compare some
selected hash functions/parameters.

Actually, in my case the hash function is not so extremely important. I hash 24-bit value to 1M hash HEAD. It's more than enough number of hash buckets. Since I use hash chains, hash function should do good enough separation and spread all 16M values to 1M hashes - to minimize hash chain branches. I kept multiplicative hashing. I tested the multiplier in next manner - I keep only HEAD for string searching (i.e. no further traverse on PREV) and measure the compression ratio on various files.
:)

Shelwien

unread,
Apr 20, 2008, 6:33:00 PM4/20/08
to encode_ru_f...@googlegroups.com


Well, I was talking about hash speed.
And that depends mostly not on the distribution of indexes,
but on the alignment and clustering of accesses.
Your explanation didn't contain anything about that.

Bulat Ziganshin

unread,
Apr 20, 2008, 6:57:00 PM4/20/08
to encode_ru_f...@googlegroups.com


Quoting: Shelwien
(x+(x>>16)) ~= (x+x/65536) = x*(65537/65536)
x*(65537/65536)*123456791 = x*123458674.801132
So I'm not so sure about that.


why? it solves problem of distributing influence of higher value bits among all hash bits

Shelwien

unread,
Apr 20, 2008, 7:03:00 PM4/20/08
to encode_ru_f...@googlegroups.com


I mean it doesn't look any different from simply changing the multiplier value.

toffer

unread,
Apr 21, 2008, 8:54:00 AM4/21/08
to encode_ru_f...@googlegroups.com


Here i really like Matt's solution, since it obviously does what you want.

Bulat Ziganshin

unread,
Apr 21, 2008, 9:30:00 AM4/21/08
to encode_ru_f...@googlegroups.com


you may cosider this as an idea of using non-integer multipliers :)

Reply all
Reply to author
Forward
0 new messages