A Measure of Hash Function Collisions

384 views
Skip to first unread message

Serhat Şevki Dinçer

unread,
Feb 4, 2019, 2:48:52 PM2/4/19
to golang-nuts
Hi,

What is the minimum sum of input lengths len(s1)+len(s2) such that:

- s1, s2 are distinct utf8 strings
- Txt2int(s1) = Txt2int(s2)

for the below simple hash function ?

func Txt2int(s string) uint64 {
x := uint64(len(s))
for i := len(s) - 1; i >= 0; i-- {
x = x<<15 ^ x>>49
x += uint64(s[i])
}
return x
}

In general, how good is "minimum sum of colliding input lengths" as a measure of collision-resistance for a (not necessarily cryptographic) hash function ?

Thanks..

Michael Jones

unread,
Feb 4, 2019, 4:29:32 PM2/4/19
to Serhat Şevki Dinçer, golang-nuts
There are many measures. One realm of them focus on the mixing properties of the design as would be a consideration in cypher systems. The other “experimentalist” realm considers how the hash performs compared to an ideal hash function. 

The latter approach is suitable for a broader range of developers. A google search for measuring hash quality will quickly lead to expectations about collisions from a random variable and also to equations for statistical measures. 

I recently did just this in an effort (successful!) to make a well-known simple hash function be its best with minor single CPU cycle changes. 

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Michael T. Jones
michae...@gmail.com

Serhat Şevki Dinçer

unread,
Feb 7, 2019, 2:03:32 PM2/7/19
to golang-nuts
On Tuesday, February 5, 2019 at 12:29:32 AM UTC+3, Michael Jones wrote:
I recently did just this in an effort (successful!) to make a well-known simple hash function be its best with minor single CPU cycle changes.

yes I am told 15 is not the best shift amount, how about this?
x = x<<27 ^ x>>37

the upper limit of length sum could be 9, because hashes of 9-byte valid utf8 strings are less than 255^9, on average ~247 per uin64 value, so one such uint64 value most likely will be zero which is the hash of empty string. so you get a length sum of 9. can you bring it down? or can we "design" a hash to guarantee 9?

Michael Jones

unread,
Feb 8, 2019, 3:57:28 AM2/8/19
to Serhat Şevki Dinçer, golang-nuts
one quick result:

celeste:spin10 mtj$ spin10 -a 0,63 -b 0,63 -bins 32 -rotate ab -set 0,0,0,0 -samples 100000 -trials 10
[000000000] best 5.683527373827505613e+01  8  0  0  0
[000000001] best 5.508690460484671547e+01  8  1  0  0
[000000002] best 5.434519430630660253e+01  8  2  0  0
[000000004] best 5.422007935279938806e+01  8  4  0  0
[000000006] best 5.280780313359068856e+01  8  6  0  0
[000000011] best 5.218225799693001932e+01  8 11  0  0
[000000015] best 5.112439186124714041e+01  8 15  0  0
[000000035] best 4.965170712324208324e+01  8 35  0  0
[000000040] best 4.907791674503666002e+01  8 40  0  0
[000000043] best 4.693620797201791817e+01  8 43  0  0
[000000324] best 4.608481026758963850e+01 13  4  0  0
[000000481] best 4.460801497488986911e+01 15 33  0  0
complete: 3648 trials, 75.095 seconds, 48.578 trials/sec

...says that in one particular test condition (8 character strings, 1M random strings, all possible shift values)
and under one particular metric of virtue...

x = x<<15 ^ x>>33

...gives the closest overall approximation to a random result. you might try that in your application to see how well it works for you. that figure of merit (44.608) is not particularly high but not terrible. A modification of FNV1a can get to 22 (almost: 2.200000000000000355e+01) here.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Serhat Sevki Dincer

unread,
Feb 8, 2019, 4:57:06 AM2/8/19
to golang-nuts, michae...@gmail.com
8 Şub 2019 Cum 11:57 tarihinde Michael Jones <michae...@gmail.com> şunu yazdı:
...says that in one particular test condition (8 character strings, 1M random strings, all possible shift values)
and under one particular metric of virtue...

x = x<<15 ^ x>>33

...gives the closest overall approximation to a random result. you might try that in your application to see how well it works for you. that figure of merit (44.608) is not particularly high but not terrible. A modification of FNV1a can get to 22 (almost: 2.200000000000000355e+01) here.

Hm, so overlapping shifts help get better mixing. As a hash function gets better at mixing bits, it will almost surely have a collision (due to birthday paradox) with two inputs with lengths <= 4.

Can we artificially guarantee a minimum sum of colliding (utf8 string) inputs that is > 8 ?

I think such a hash needs to satisfy the following:

1) Number of valid utf8 strings with length <= 8 is less than (255^9 -  1)/254 which is less than 2^64
So if the hash maps these strings to distinct uint64 values, then minimum sum of colliding input lengths is at least 9. Can it be > 9 ?

2) Be non-trivial. For example, this hash satisfies 1):
if len(input) < 8, pad with zeros up to total 8 bytes and return
otherwise return first 8 bytes of input

this also satisfies 1):
Treat input as integer in base 255, return input mod 2^64

3) have equidistribution for overall inputs

Instead of statistical methods, we can maybe brute search among parametric simple hash functions for one that satisfies these 3 properties? Or explicitly design one? 

Michael Jones

unread,
Feb 8, 2019, 11:43:09 AM2/8/19
to Serhat Sevki Dincer, golang-nuts
clustering:

careful hash functions often treat short inputs specially.

iterated shift-xor alone is weak in expanding the "changed bits(s)" signal, at least by comparison to a) large prime multiply, b) good s-boxes, c) introduction of keyed material.

collisions will happen, and no general hash will attain the full period possibility. yes a perfect hash is possible but not to my knowledge for general strings.

strings tend to be very non-random (a kind of zipf's law for word spelling/product names/text...) so there is a difficult demand for the hash function. in spirit you want to encrypt and carefully fold down to tablesize.

utf-8/16/32 strings are somewhat sparse...so you don't have as many bits of input material as you may want.

searching through parameter space works very well for many hash and similar goals. important to know how to model the input and measure output quality.

Serhat Sevki Dincer

unread,
Feb 9, 2019, 9:23:14 AM2/9/19
to Michael Jones, golang-nuts
On Fri, Feb 8, 2019 at 7:42 PM Michael Jones <michae...@gmail.com> wrote:
> clustering:
> https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html
>
> careful hash functions often treat short inputs specially.
>
> iterated shift-xor alone is weak in expanding the "changed bits(s)" signal, at least by comparison to a) large prime multiply, b) good s-boxes, c) introduction of keyed material.
Hm, thanks. I would like to try a particular version of this prime
multiplication idea wih utf8 strings.
I wrote for loops to find collisions (attached). It takes 3 seconds
and finds no collision for strings with length < 4.
The next step (including length=4, commented in the code) will take
13+ min and 32+ gb ram, so I appreciate if anyone with sufficient RAM
could try that and report the result ;P
Thanks..
txt2int.go

Serhat Şevki Dinçer

unread,
Feb 12, 2019, 8:23:39 AM2/12/19
to golang-nuts
On Saturday, February 9, 2019 at 5:23:14 PM UTC+3, Serhat Şevki Dinçer wrote:
I got access to a server with 64 gb ram. On it, using a Go map did not work, so I used a list and sorted it for identifying collisions.
It turned out both prime and shift versions (attached) of the simple hash turned out to be really good for small inputs. No collisions for 7_918_845_952 inputs.
This required 59 GiB of ram. For a 64-bit cryptographic hash output, the probability of a collision for that many inputs is about %82.

What I am curious about next is
- How further can this test be taken ? When are the first collisions for these simple hashes with the given code ?
- What are their "minimum sum of colliding inputs lengths" ?
- Is there a (smooth) trade-off between being cryptographic / good-bit-mixing and being nice on small inputs / high minimum sum of colliding inputs lengths ? Assume that we dont treat small inputs differently, and there is one general algorithm for all inputs..
- Can a cryptographic hash function have high minimum sum of colliding inputs lengths, or be nice on small inputs ?

Thanks..
txt2int2.go

Damian Gryski

unread,
Feb 12, 2019, 12:23:33 PM2/12/19
to golang-nuts
For more information on hash function goodness tests, check out https://github.com/demerphq/smhasher

Damian

Michael Jones

unread,
Feb 12, 2019, 1:51:17 PM2/12/19
to Damian Gryski, golang-nuts
Serhat, some ideas for you...

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Serhat Şevki Dinçer

unread,
Feb 13, 2019, 3:20:08 PM2/13/19
to golang-nuts
On Tuesday, February 12, 2019 at 9:51:17 PM UTC+3, Michael Jones wrote:
Serhat, some ideas for you...

Interestingly I found out input iteration order does matter :) 15,33 shift version yields an amazing number of collisions (7_910_886_368 collisions over 7_918_845_952 inputs) when fed with a spesific 6-byte sequence (attached).
rotate-27 version interestingly gave no collisions for this case. prime version also had no collisions. prime version with ^= and -= instead of += also gave no collision.

I want to identify a collision on the prime version (any variant ^= -= +=) with small inputs, say sum of lengths <= 12. Does anyone have any idea how to identify such inputs?
So far prime version seems to be the most promising (in terms of high minimum sum of collding inputs)..

Thanks..

Note: Thanks Damian, it looks like an advanced test suite for cryto hashes, though text input tests seems limited.
txt2int3.go

Michael Jones

unread,
Feb 13, 2019, 5:58:31 PM2/13/19
to Serhat Şevki Dinçer, golang-nuts
Serhat,

Some more ideas for you to consider: the expected number of collisions for an ideal random hash, the option of "folding in" the high bits of the hash rather than truncating, and finer control of operation.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Serhat Sevki Dincer

unread,
Feb 14, 2019, 1:11:47 AM2/14/19
to Michael Jones, golang-nuts
14 Şub 2019 Per 01:58 tarihinde Michael Jones <michae...@gmail.com> şunu yazdı:
Serhat,

Some more ideas for you to consider: the expected number of collisions for an ideal random hash, the option of "folding in" the high bits of the hash rather than truncating, and finer control of operation.


Hi Michael,
Here with your variant:

x = uint64(1<<64-59) ^ uint64(len(s))

for i := len(s) - 1; i >= 0; i-- {
x ^= uint64(s[i])
x *= 11400714819323198549
}

// fold high bits

you have a direct collision with input length's lowest byte (which is practically the length) and input's first byte. This is better for initialization:

x = uint64(len(s))
x *= 11400714819323198549

at the cost of an extra multiplication. 

Folding high bits does not change collision characteristic, right?
Also when the last operation is a multiplication, you don't seem to need it, if at all it is useful. 

Michael Jones

unread,
Feb 14, 2019, 2:24:49 AM2/14/19
to Serhat Sevki Dincer, golang-nuts
good observations.

here is a help for you... your program modified to do any size computation in any size of memory
(so you need not look for that special machine)

Serhat Sevki Dincer

unread,
Feb 19, 2019, 6:15:52 AM2/19/19
to golang-nuts, Chris Burkert, Michael Jones
Hi,

Chris kindly searched collisions on his 6 TiB ram server and we could
not find any for more than 5 x 2^37 inputs (for both + and ^ versions)
! Final version of the hash is available at
https://github.com/jfcg/sixb

Let me know if you find one ;)

On Tue, Feb 19, 2019 at 10:44 AM Chris Burkert <burker...@gmail.com> wrote:
> Good Morning Serhat, here it is:
>
> # time ./prime2x
> prime2x xor: 0 collisions
>
> real 355m51.484s
> user 12490m36.231s
> sys 4079m42.192s
>
>>>> Am Mo., 18. Feb. 2019 um 08:47 Uhr schrieb Serhat Sevki Dincer <jfcg...@gmail.com>:
>>>>> Great, can you also try xor version?
>>>>> I wonder if it will have any collisions.
>>>>>
>>>>> I made a tiny improvement to sorting here github.com/jfcg/sorty
>>>>>
>>>>> Thanks..
>>>>>
>>>>> 18 Şub 2019 Pzt 10:42 tarihinde Chris Burkert <burker...@gmail.com> şunu yazdı:
>>>>>> Hello Serhat,
>>>>>> this time it ran fine:
>>>>>>
>>>>>> # time ./prime2m
>>>>>> prime2m add: 0 collisions
>>>>>>
>>>>>> real 350m28.446s
>>>>>> user 12765m29.456s
>>>>>> sys 5997m49.762s
>>>>>>
>>>>>> cheers
>>>>>> Chris
>>>>>>
>>>>>> Am Mo., 18. Feb. 2019 um 08:11 Uhr schrieb Serhat Sevki Dincer <jfcg...@gmail.com>:
>>>>>>> Thanks, how is the new code doing in terms of cpu / ram usage?
>>>>>>> Is it still running?
prime2x.go
Reply all
Reply to author
Forward
0 new messages