strange behaviour

11 views
Skip to first unread message

Matthias Zengler

unread,
Jul 5, 2011, 7:46:22 AM7/5/11
to High Performance Primitive Collections for Java
hello,

i've encountered a problem with HPPC today. I used the
IntIntOpenHashMap an put about 1.6 Million values.
Namely the Hashcodes of Strings using this way:

int tmp = Arrays.hashCode(stringToHash.getBytes());
hashMap.put(tmp, value);

For 1.6M values i got approx. 1400 collisions. But when i use the
string instead of it's hashcode in an ObjectIntHashMap<String> i got 0
collisions.

I'm not sure whether this is a bug or intended.

Greetings,
Matthias

Dawid Weiss

unread,
Jul 6, 2011, 3:16:48 AM7/6/11
to java-high-performance...@googlegroups.com

Can you post a junit test or code that you used to measure this? Doesnt look like a bug but i will look into it.

Matthias Zengler

unread,
Jul 6, 2011, 9:51:11 AM7/6/11
to High Performance Primitive Collections for Java
i have a file with rougly 1.6M entrys to hash. This entrys where
sorted and uniqed before. with wc -l file i got the number of lines
this number differs from actual Hash size and i write an output when
the node which is returned by the put function is not 0, which means,
that there was a node with a certain ID. atm i can't post the code
because it is in the office. i hope this will help

greetings

On 6 Jul., 09:16, Dawid Weiss <dawid.we...@gmail.com> wrote:
> Can you post a junit test or code that you used to measure this? Doesnt look
> like a bug but i will look into it.
> On Jul 5, 2011 1:56 PM, "Matthias Zengler" <matthias.zeng...@googlemail.com>

Dawid Weiss

unread,
Jul 7, 2011, 3:56:41 AM7/7/11
to java-high-performance...@googlegroups.com
I'm sorry, but I don't follow: if your input file has been uniqued,
the hash map's size should be identical to wc -l. The number of
collisions is something internal to the implementation -- equality
relationship will allow entries with a hash collision to coexist in
the map (hash function is only used to distribute entries to minimize
collisions, not avoid them).

The actual number of collisions can differ, especially if you use
strings and ints as keys because they use different hashes. To
actually calculate the number of collisions you'd need to alter or
track the internals of the implementation though (loops looking for an
empty or occupied slot).

Dawid

Dawid Weiss

unread,
Jul 10, 2011, 6:42:02 AM7/10/11
to java-high-performance...@googlegroups.com
I'm back from vacation. I think you did make a logical mistake here
and created a hash map of hashes (ints) and a hashmap of strings --
these are not the same thing. The 1400 is not the number of
"collisions" in the hashmap, but the number of pseudo-collisions in
the Arrays.hashCode(...) -- identical hashes for apparently different
strings.

The number of actual slot collisions in HPPC can be measured as well,
but you'd need to peek inside the implementation and alter slot lookup
functions.

Dawid

Matthias Zengler

unread,
Jul 12, 2011, 4:46:01 AM7/12/11
to High Performance Primitive Collections for Java
Hey Dawid,

thanks for your answers(and HPPC, we love it!).
We thought the same way, so we first write all hashcodes of
Arrays.hashCode() to a file, sort uniq -d them to identifiy
the strings where the hashCode is the same, but they where uniq for
all ints, no doublings occur to this point.
Nevertheless your are right, collisions might occur in every hash
function, so we used the string option with no collisions and it works
fine.

Matthias

Dawid Weiss

unread,
Jul 12, 2011, 5:07:34 AM7/12/11
to java-high-performance...@googlegroups.com
> thanks for your answers(and HPPC, we love it!).

Glad it's useful for someone!

> We thought the same way, so we first write all hashcodes of
> Arrays.hashCode() to a file, sort uniq -d them to identifiy
> the strings where the hashCode is the same, but they where uniq for
> all ints, no doublings occur to this point.

But they were hashcodes of the strings and when you create an Int*Map
then the input keys (ints) are re-hashed again and this may cause new
collisions. This additional rehashing is useful to randomize bad input
keys distributions and provides gains in practical applications
(experience speaking here, no proofs).

Dawid

Reply all
Reply to author
Forward
0 new messages