Migrate path for trove4j to hppc

Peter

unread,

Nov 19, 2016, 10:29:11 AM11/19/16

to High Performance Primitive Collections for Java

Hi,

we are using currently trove4j in GraphHopper and 3 years ago I measured a performance slow down (8%) when using hppc, which seems to be no longer the case :) - nice work! Now I would like to migrate away from trove4j due to license stuff, but also because of the bigger size of the project. See the issue here https://github.com/graphhopper/graphhopper/issues/34

Migration went smoothly as several concepts are very similar. Still I stumbled over the following problems:

IntArrayList has no fill, no shuffle, no binarySearch and no reverse method - is there a utility class which does this or parts of this? (should be medium work to workaround)
The method IntObjectHashMap.forEach does not return the boolean from apply but instead a reference to the Predicate. For what is this helpful? (should be easy to workaround)
The hash implementations always use 0 as default for 'empty' or not existent and we need it to be -1 as we have keys with 0.

Regarding 3: do you plan to support different default values? Or is there a workaround? I would just need it for TIntHashSet and IntObjectHashMap I think.

Regards

Peter

unread,

Nov 19, 2016, 4:44:21 PM11/19/16

to High Performance Primitive Collections for Java

Hi again,

could sorted out those issues.

re: 1. implemented fill, shuffle, reverse and used Arrays.binarySearch(array.buffer, 0, size(), value);

re: 2. counted successful returns and compared this at the end to the size of the map

re: 3. IntHashSet works without this and for IntObjectHashMap I don't need it

Some further issues I'm currently fighting with:

4. Now I stumbled over other issues when using IntObjectHashMap the iteration order seems to be randomized via the default HashOrderMixing. Now I don't like randomness as it makes testing hard and also production issues are harder to reproduce and would like to turn this off. Why is the deterministic deprecated and what does the javadocs mean with 'Use constant mixers only if you understand the potential consequences'?

5. iterator.remove is not implemented for IntHashSet - it throws UnsupportedOperationException - should be easy to workaround

6. missing method: list.toArray(indexFrom, indexTo) (I'll also create a helper method)

Regards

Peter

Dawid Weiss

unread,

Nov 21, 2016, 3:30:46 AM11/21/16

to java-high-performance-primitive-collections

Hi Peter,

> we are using currently trove4j in GraphHopper and 3 years ago I measured a
> performance slow down (8%) when using hppc, which seems to be no longer the
> case :)

You must have measured something incorrectly because there was hardly
a change in HPPC over these three years and I believe it very likely
had to be significantly (?) faster since its inception (because I knew
how trove4j was implemented). Anyway.

> IntArrayList has no fill, no shuffle, no binarySearch and no reverse method
> - is there a utility class which does this or parts of this? (should be
> medium work to workaround)

HPPC is very focused -- it provides the very basics that are used most
often. This results in smaller on-disk footprint, but of course the
drawback is that
there are missing functional elements. If you need full collection
support, try fastutil or koloboke -- they both provide them I believe.

> The method IntObjectHashMap.forEach does not return the boolean from apply
> but instead a reference to the Predicate. For what is this helpful? (should
> be easy to workaround)

It's so that you can access any field of an internal anonymous class
without giving it a name. From javac's perspective the returned type
has full signature of all fields
and methods, so you can invoke them without giving the type an explicit name.

> The hash implementations always use 0 as default for 'empty' or not existent
> and we need it to be -1 as we have keys with 0.

It's because zeroing is what JVM is doing by default. These small
differences mattered to us. Again -- if you need something different,
try fastutil or koloboke. Or use a method
that returns the default value (getOrDefault):

http://carrotsearch.github.io/hppc/releases/0.7.1/api/com/carrotsearch/hppc/IntObjectMap.html#getOrDefault-int-VType-

> 4. Now I stumbled over other issues when using IntObjectHashMap the iteration order seems to be randomized via the default HashOrderMixing. Now I don't like randomness as it makes testing hard and also production issues are harder to reproduce and would like to turn this off.

Don't. Relying on fixed hashing order for associative arrays (even in
tests) is a very weak practice. You should encounter errors in your
tests by explicitly randomizing this order, rather than fix it for
production just because you think it'll dodge the problem. It won't.
Besides, randomization serves an important purpose here (which can
result in very pool performance when merging associative arrays
key-by-key).

> Why is the deterministic deprecated and what does the javadocs mean with 'Use constant mixers only if you understand the potential consequences'?

Exactly what I mentioned above. See HPPC-115 issue entry in changes.txt.

> 5. iterator.remove is not implemented for IntHashSet - it throws UnsupportedOperationException - should be easy to workaround

It's better to use removeAll(IntPredicate predicate).

> 6. missing method: list.toArray(indexFrom, indexTo) (I'll also create a helper method)

Slice the buffer directly. We intentionally didn't implement such
convenience methods (for reasons given at the beginning).

D.

Karussell

unread,

Nov 21, 2016, 11:50:03 AM11/21/16

to java-high-performance...@googlegroups.com

Hi Dawid,

Thanks for the detailed answers! And regarding the tests: I'm pretty
sure I tested multiple times ... but maybe we changed stuff which caused
the changes ... anyway I do not care now :)

I did not choose fastutil as I would have an additional step to make it
smaller but will try koloboke again. To be honest: I just started with
hppc because I know you as an lucene contributor and also know that hppc
is used in ElasticSearch.

What do you think regarding community size and activity. Would you bet
on koloboke?

> Don't. Relying on fixed hashing order for associative arrays (even in
tests) is a very weak practice

I'm not sure I understood the issue. Will this problem not only occur if
we merge two hash/maps (we don't have this use case)? Or will it also
occur for resizing?
Does trove or koloboke also suffer under this issue? Never heard of this
before and for us reproducability is VERY important. Is there a way to
get both :) ?

Kind Regards
Peter

Dawid Weiss

unread,

Nov 22, 2016, 3:28:12 AM11/22/16

to java-high-performance-primitive-collections

Hi Peter,

> I did not choose fastutil as I would have an additional step to make it
> smaller but will try koloboke again. To be honest: I just started with
> hppc because I know you as an lucene contributor and also know that hppc
> is used in ElasticSearch.

Thanks. :) I think ES uses HPPC only in very marginal cases though. We
published HPPC because it provided a nice tradeoff between speed and the
size of the library (compared to fastutil, for example). Since then a few
other libraries showed up. Koloboke is quite nice, it even permits you
to generate
only the collection classes you're interested in. Check it out. I maintain a
list of alternatives sporadically, it's here:

https://github.com/carrotsearch/hppc/blob/master/ALTERNATIVES.txt

> What do you think regarding community size and activity. Would you bet
> on koloboke?

Oh, I don't think there is any significant community. It's not that
type of project.
Cars have communities, their nuts and bolts rarely do :)

> I'm not sure I understood the issue. Will this problem not only occur if
> we merge two hash/maps (we don't have this use case)?

It can occur when merging or when you're downsizing an existing map
(by copying to an empty one, for example). Too much to explain here,
but if you're interested
you can look at commit history and check out the tests that show you
specifically
what can happen and when.

This problem only affects associative data structures with linear
probing, so it's very specific.
But then this type of keys redistribution is also the fastest one out
there. So there's the gain
and there's the pain.

Use randomized ordering, by the way. Reproducibility and randomization
can go together if you're
using a sequencer as an initial seed for key mixers. This is the
default in HPPC:

https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/RandomizedHashOrderMixer.java

Obviously if there are multithreaded data races the outcome will not
be predictable, but this would be the same case
with a constant key redistribution too.

Dawid

> --
> You received this message because you are subscribed to the Google Groups "High Performance Primitive Collections for Java" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to java-high-performance-primi...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Karussell

unread,

Nov 28, 2016, 6:22:52 AM11/28/16

to java-high-performance...@googlegroups.com

Hi Dawid,

thanks for the explanations! I'll try to digg through the commits when I find some time :)

Do you think that trove4j suffers from this hash mixing problem in its latest version?

> It can occur when merging or when you're downsizing an existing map

And when I make the seed "size dependent" (like you suggested somewhere in the code) this should be no issue anymore, or did I overlook something?

> RandomizedHashOrderMixer

This still does not work. E.g. when I'm executing a single test this results in a different order compared to running this same test within a test suite where the counter was already increased.

Kind Regards
Peter

Dawid Weiss

unread,

Nov 28, 2016, 7:33:44 AM11/28/16

to java-high-performance-primitive-collections

> Do you think that trove4j suffers from this hash mixing problem in its
> latest version?

Don't know, didn't track the changes closely.

> And when I make the seed "size dependent" (like you suggested somewhere in
> the code) this should be no issue anymore, or did I overlook something?

Size-dependent mixers suffer from the same issue, unfortunately.

>> RandomizedHashOrderMixer
>
> This still does not work. E.g. when I'm executing a single test this results
> in a different order compared to running this same test within a test suite
> where the counter was already increased.

Everything is reproducible globally if you start form the same state.
So yes, a single test won't reproduce in the same way in isolation
compared to executing it from a global suite. If your suite fails
during tests you should be able to reproduce the problem by running
the entire suite (and then probably random-seed a single test to check
for regressions). You can also use any other mixer you like, but I'd
argue that enforcing any global state that can result in conflicting
key sequences is probably a very bad idea (much better to fix the code
to never, absolutely never, rely on hash map ordering).

Dawid

Karussell

unread,

Nov 28, 2016, 12:40:56 PM11/28/16

to java-high-performance...@googlegroups.com

> (much better to fix the code to never, absolutely never, rely on hash map ordering).

We do not rely on the order. So our algorithms would be still correct. But they wouldn't produce identical results. E.g. it can easily happen that two routes have exactly the same distance (even in real world!) and with one hash ordering we return one path of those optimal solution, and another hash ordering produces a different path.

To be honest: I still do not see this problem for us (and as we used trove4j which highly likely suffers from the same issue), but highly likely I have not fully understood the issue but will read about it more. Currently I see it as follows: we do not use two hash maps to create a third one and if I use the "size dependent" hash mixing strategy I can easily grow my hash maps without trapping into this problem for the same hash map.

Again, I'll investigate this more - maybe we get some performance boosts with random seeds which would show that we already have/had a problem here :)

> So yes, a single test won't reproduce in the same way in isolation compared to executing it from a global suite.

But this makes no sense to me. A test should return the same stuff in isolation, as it does for a test suite. Otherwise it is highly confusing.

Kind Regards
Peter

Dawid Weiss

unread,

Nov 28, 2016, 3:28:56 PM11/28/16

to java-high-performance-primitive-collections

> We do not rely on the order. So our algorithms would be still correct. But
> they wouldn't produce identical results. E.g. it can easily happen that two
> routes have exactly the same distance (even in real world!) and with one
> hash ordering we return one path of those optimal solution, and another hash
> ordering produces a different path.

That's exactly what I'm saying -- if this is the case then you are
relying on map ordering and your algorithm's result
is predictably non-deterministic. If the two routes have an equivalent
score then returning any of them as a result should
be an acceptable output (and the tests should take this into
account!). If consistency is part of the returned output's contract
then all pareto-optimal solutions should be sorted to a secondary
criterion (say, the ordering of node IDs or something), so that the
output is always consistent, regardless of intermediate ordering of
associative data structures.

> To be honest: I still do not see this problem for us (and as we used trove4j
> which highly likely suffers from the same issue),

This isn't something that is very likely, but if it bites, it'll hang
your application at runtime.

> Again, I'll investigate this more - maybe we get some performance boosts
> with random seeds which would show that we already have/had a problem here
> :)

It'd be more likely that you get more uniform, but a bit slower runs.

>> So yes, a single test won't reproduce in the same way in isolation
>> compared to executing it from a global suite.
>
> But this makes no sense to me. A test should return the same stuff in
> isolation, as it does for a test suite. Otherwise it is highly confusing.

A test that is independent of any global state would return the same
result. A test that relies on some global state (as in mutable static
test fixtures, system properties, file system state, you name it) may
not be reproducible with the same seed. People are often locked
in to the notion of "identical" test runs, which is a bad assumption
in general. Think of test method ordering in Java 8 that all of a
sudden stopped being predictable -- many tests crashed just because
the global state wasn't accessed in the same order.

Yes, it is true that random hash map mixers make it very difficult to
create an immutable test fixture (or a repeatable test run detached
from its global state). But then again -- I'd embrace this
randomness rather than fight with it. If your tests run with changing
map/set iteration order, they'll be more rebust.

Look at Lucene tests -- they shuffle stuff (components, locales, time
zones) underneath all tests in all sorts of ways and occasionally
things break. But it's quite rare that something is not reproducible
at all (once it's narrowed down, you can typically write a repro for
it). If something non-reproducible happens, it's typically a VM
problem...

Dawid

Reply all

Reply to author

Forward