ImmutableSet performance worse tham HashSet

Noble

unread,

Jan 11, 2010, 4:41:48 AM1/11/10

to guava-discuss

My tests show that HashSet is ~250% faster than immutableSet

and
Size:
Collections.unmodifiableMap: 7.4% bigger than HashMap
google immutable map: 22.4% bigger than HashMap

Speed:
Collections.unmodifiableMap: 4.2% slower than HashMap
google immutable map: 26.0% slower than HashMap

is it expected?

Robert Konigsberg

unread,

Jan 11, 2010, 4:49:40 AM1/11/10

to guava-...@googlegroups.com

Please show your tests! That is interesting. Also, which VM / OS are you using?

> --
> guava-...@googlegroups.com.
> http://groups.google.com/group/guava-discuss?hl=en
> unsubscribe: guava-discus...@googlegroups.com
>
> This list is for discussion; for help, post to Stack Overflow instead:
> http://stackoverflow.com/questions/ask
> Use the tag "guava".
>
>

--
Robert Konigsberg
konig...@gmail.com

Dimitris Andreou

unread,

Jan 11, 2010, 5:13:12 AM1/11/10

to guava-...@googlegroups.com

Hi

Firstly, you might also want to check this thread:
http://groups.google.com/group/google-collections-users/browse_thread/thread/9bf5db0f507f7fa7/2b8fa3f34b3e69b3

2010/1/11 Noble <noble...@gmail.com>:

> My tests show that HashSet is ~250% faster than immutableSet
>
> and
> Size:
> Collections.unmodifiableMap: 7.4% bigger than HashMap

You mean memory footprint? 7.4% doesn't make sense, unless you are
talking about very tiny HashMaps. Collections.unmodifiableMap adds a
negligible memory footprint (12 bytes from the top of my head)

> google immutable map: 22.4% bigger than HashMap
>
> Speed:
> Collections.unmodifiableMap: 4.2% slower than HashMap
> google immutable map: 26.0% slower than HashMap
>
>
> is it expected?

Not really :) but then, since you are not saying what exactly you are
measuring, all bets are off.

Dimitris

Noble

unread,

Jan 11, 2010, 6:42:34 AM1/11/10

to guava-discuss

This is a simple program http://gist.github.com/274175 for Set.

-

On Jan 11, 3:13 pm, Dimitris Andreou <jim.andr...@gmail.com> wrote:
> Hi
>

> Firstly, you might also want to check this thread:http://groups.google.com/group/google-collections-users/browse_thread...
>
> 2010/1/11 Noble <noble.p...@gmail.com>:

Dimitris Andreou

unread,

Jan 11, 2010, 7:09:57 AM1/11/10

to guava-...@googlegroups.com

It's hard to say since the test is very incomplete. Both sets have an
underlying table of length=64, and use the same different hash, but
still, in the example you picked:
- In HashSet, "StrField.class" is discovered immediately
- In ImmutableSet, it collides with others, and only with the 4th
attempt is "StrField.class" discovered.

Weird, from a cursory look, it seems that ImmutableSet ought to
simulate exactly a HashSet, using the same hash, table size and all.

Also, there are some problems like including the time to do a println
(for HashSet) in the time measured for ImmutableSet, and that
currentTimeMillis() is used instead of nanoTime(), but anyway,
something is strange here.

Dimitris

2010/1/11 Noble <noble...@gmail.com>:

Dimitris Andreou

unread,

Jan 11, 2010, 7:13:13 AM1/11/10

to guava-...@googlegroups.com

Sorry, I wrote the first sentence while I thought that it happened
that the particular example chosen was a bad one - only afterwards I
realized that these should behave equivalently since they used the
same hash, forgot to remove it.

2010/1/11 Dimitris Andreou <jim.a...@gmail.com>:

Kevin Bourrillion

unread,

Jan 11, 2010, 1:10:36 PM1/11/10

to guava-...@googlegroups.com

Noble,

I'm sorry to have to tell you the sad news that simple, common-sense test programs like this, for measuring performance of a Java library, are absolutely inconclusive. We will be publishing a statistically rigorous assessment of ImmutableSet performance shortly (for some definition of "shortly"). In fact, we did extensive testing of this before ImmutableSet was released, but rather than just refer to those ancient numbers, please be patient to hear fresh results.

Kevin Bourrillion @ Google
internal: http://go/javalibraries
external: guava-libraries.googlecode.com

fishtoprecords

unread,

Jan 11, 2010, 7:06:47 PM1/11/10

to guava-discuss

On Jan 11, 1:10 pm, Kevin Bourrillion <kev...@google.com> wrote:
> I'm sorry to have to tell you the sad news that simple, common-sense test
> programs like this, for measuring performance of a Java library, are
> absolutely inconclusive.

I assume that this is because short tests don't let the JVM hotspot
optimizer do its thing. Is there more?

From what I've seen in real world testing, the recent JVMs are quite
impressive at all sorts of optimization, from code inline-ing, loop
unrolling, constant propagation to fancier stuff.

I'm interested in the larger scale optimizations and timings, I did a
fair amount of work in that area when I was in grad school.

A related question that I don't yet have an empirical answer to is:

what rule of thumb should be used in selecting between a HashSet-style
implementation and a TreeSet implementation. While the trivial answer
is that HashSet is O(1) and TreeSet is O(ln N), with 1 < ln N for all
meaningful cases, in practice the constants for the O(1) function seem
fairly high, and since the TreeSet, as a red/black tree is optimal, I
suspect that the naive assumption is not accurate in the real work.

Noble

unread,

Jan 12, 2010, 12:50:03 AM1/12/10

to guava-discuss

Hi Kevin ,
Here is a more elaborate and thorough test for the map .
https://issues.apache.org/jira/secure/attachment/12429895/TestPerf.java
and the we were tracking it in the issue https://issues.apache.org/jira/browse/SOLR-1707

I was planning to incorporate google collections to Apache Solr
because everyone claimed it faster and more efficient. (For us
performance is everything ). If the performance is comparable it is
acceptable because google collections offer a neat API.

when you say the test is inconclusive, is there a reason to suggest
what we observe in test will not be same in production as well?

On Jan 11, 11:10 pm, Kevin Bourrillion <kev...@google.com> wrote:
> Noble,
>
> I'm sorry to have to tell you the sad news that simple, common-sense test
> programs like this, for measuring performance of a Java library, are
> absolutely inconclusive. We will be publishing a statistically rigorous
> assessment of ImmutableSet performance shortly (for some definition of
> "shortly"). In fact, we did extensive testing of this before ImmutableSet
> was released, but rather than just refer to those ancient numbers, please be
> patient to hear fresh results.
>

> On Mon, Jan 11, 2010 at 4:13 AM, Dimitris Andreou <jim.andr...@gmail.com>wrote:
>
>
>
>
>
> > Sorry, I wrote the first sentence while I thought that it happened
> > that the particular example chosen was a bad one - only afterwards I
> > realized that these should behave equivalently since they used the
> > same hash, forgot to remove it.
>

> > 2010/1/11 Dimitris Andreou <jim.andr...@gmail.com>:

> > > It's hard to say since the test is very incomplete. Both sets have an
> > > underlying table of length=64, and use the same different hash, but
> > > still, in the example you picked:
> > > - In HashSet, "StrField.class" is discovered immediately
> > > - In ImmutableSet, it collides with others, and only with the 4th
> > > attempt is "StrField.class" discovered.
>
> > > Weird, from a cursory look, it seems that ImmutableSet ought to
> > > simulate exactly a HashSet, using the same hash, table size and all.
>
> > > Also, there are some problems like including the time to do a println
> > > (for HashSet) in the time measured for ImmutableSet, and that
> > > currentTimeMillis() is used instead of nanoTime(), but anyway,
> > > something is strange here.
>
> > > Dimitris
>

> > > 2010/1/11 Noble <noble.p...@gmail.com>:
> > >> This is a simple programhttp://gist.github.com/274175for Set.

>
> > >> -
>
> > >> On Jan 11, 3:13 pm, Dimitris Andreou <jim.andr...@gmail.com> wrote:
> > >>> Hi
>
> > >>> Firstly, you might also want to check this thread:
> >http://groups.google.com/group/google-collections-users/browse_thread...
>
> > >>> 2010/1/11 Noble <noble.p...@gmail.com>:
>
> > >>> > My tests show that HashSet is ~250% faster than immutableSet
>
> > >>> > and
> > >>> > Size:
> > >>> > Collections.unmodifiableMap: 7.4% bigger than HashMap
>
> > >>> You mean memory footprint? 7.4% doesn't make sense, unless you are
> > >>> talking about very tiny HashMaps. Collections.unmodifiableMap adds a
> > >>> negligible memory footprint (12 bytes from the top of my head)
>
> > >>> > google immutable map: 22.4% bigger than HashMap
>
> > >>> > Speed:
> > >>> > Collections.unmodifiableMap: 4.2% slower than HashMap
> > >>> > google immutable map: 26.0% slower than HashMap
>
> > >>> > is it expected?
>
> > >>> Not really :) but then, since you are not saying what exactly you are
> > >>> measuring, all bets are off.
>
> > >>> Dimitris
>
> > >>> > --
> > >>> > guava-...@googlegroups.com.
> > >>> >http://groups.google.com/group/guava-discuss?hl=en

> > >>> > unsubscribe: guava-discus...@googlegroups.com<guava-discuss%2Bunsubscribe@goog legroups.com>

>
> > >>> > This list is for discussion; for help, post to Stack Overflow
> > instead:
> > >>> >http://stackoverflow.com/questions/ask
> > >>> > Use the tag "guava".
>
> > >> --
> > >> guava-...@googlegroups.com.
> > >>http://groups.google.com/group/guava-discuss?hl=en

> > >> unsubscribe: guava-discus...@googlegroups.com<guava-discuss%2Bunsubscribe@goog legroups.com>

>
> > >> This list is for discussion; for help, post to Stack Overflow instead:
> > >>http://stackoverflow.com/questions/ask
> > >> Use the tag "guava".
>
> > --
> > guava-...@googlegroups.com.
> >http://groups.google.com/group/guava-discuss?hl=en

> > unsubscribe: guava-discus...@googlegroups.com<guava-discuss%2Bunsubscribe@goog legroups.com>

Kevin Bourrillion

unread,

Jan 12, 2010, 12:20:41 PM1/12/10

to guava-...@googlegroups.com, Jesse Wilson

On Mon, Jan 11, 2010 at 9:50 PM, Noble <noble...@gmail.com> wrote:

Hi Kevin ,
Here is a more elaborate and thorough test for the map .
https://issues.apache.org/jira/secure/attachment/12429895/TestPerf.java

I'm afraid there are still major problems with the methodology -- and I mean no discredit to you when I say that. Microbenchmarking in Java is an incredibly devious, obscure, massively counterintuitive sport. It's very hard to explain exactly what all the gotchas are, so Jesse Wilson and I are building a microbenchmarking framework that we hope will be able to take care of about half of them for you. (http://caliper.googlecode.com, not that it's especially ready to receive a lot of attention yet.)

I was planning to incorporate google collections to Apache Solr
because everyone claimed it faster and more efficient.

This blanket statement is pretty meaningless. We've made no such claim. Raw performance is not even one of our primary goals yet. Of course, the fact that ImmutableSet has a substantial memory savings over HashSet is easily observable, as is the fact that many times you can achieve a better "big-O" complexity using our APIs. Execution time is a very slippery measurement though.

(For us
performance is everything ). If the performance is comparable it is
acceptable because google collections offer a neat API.

You'll have to pursue the answer to that question on a feature-by-feature basis. When Caliper's ready, I'm hopeful that it will be a huge help to you!

when you say the test is inconclusive, is there a reason to suggest
what we observe in test will not be same in production as well?

Not *a* reason -- *many* reasons! And the same is true (though to a lesser extent, of course) of even the very best microbenchmark man or woman could devise. It will NOT accurately represent what happens in production! However, a change that improves a microbenchmark in a statistically significant way often gives us *plausibility* that the change probably ought to be better in real life. That's the best information we get to act on!

Read this (from yet another Google Collections/Guava frequent advisor): http://wiki.jvmlangsummit.com/MindTheSemanticGap and especially the links at the bottom. I hope it's eye-opening.

And if you're feeling adventurous, you could try to checkout caliper, open it in IDEA and run this class:

http://code.google.com/p/caliper/source/browse/trunk/src/examples/SetContainsBenchmark.java

HTH.

Kurt Alfred Kluever

unread,

Jan 12, 2010, 12:48:57 PM1/12/10

to guava-...@googlegroups.com, Jesse Wilson

For some further reading on performance testing, I'd highly recommend section 12.3 ('Avoiding performance testing pitfalls') of Java Concurrency in Practice (Goetz et al.)

Cheers,

Kurt Alfred Kluever | k...@google.com | (617) 500-KURT

Dimitris Andreou

unread,

Jan 12, 2010, 1:59:14 PM1/12/10

to guava-...@googlegroups.com

Kevin

While I agree with what you say, could you comment on my questions
earlier in the thread? From a look, it seems that ImmutableSet ought
to have the same behavior as HashSet - same hash and same table size,
yet in the example, the same value loopup had different behavior - how
do you explain the discrepancy? Obviously some assumption I make is
wrong - but which one?

Regards,

Dimitris

2010/1/12 Kevin Bourrillion <kev...@google.com>:

Kevin Bourrillion

unread,

Jan 12, 2010, 2:22:22 PM1/12/10

to guava-...@googlegroups.com

Sorry I missed that.

ImmutableSet's implementation is fundamentally different from HashSet's; it uses open addressing (http://en.wikipedia.org/wiki/Open_addressing; aka, it's not "chained"). This saves memory and has better cache locality, which is, I believe, why I.S. does only a little bit better than HashSet for small sets but significantly better for very large sets. Closed addressing does have very important advantages of its own too (though one of these advantages is how much easier it is to implement removal, which doesn't matter for us!).

With our tables, which we only allow to be up to 50% full, having the occasional key get stuck with three reprobes will happen occasionally, not super-often, and isn't really considered that bad. size/2 reprobes, now that would be bad! In fact, many have asked us for the option to pack their tables more full than this, and reprobes be damned.

Anyway, the linear probing we currently do might not be optimal; if someone wants to run my SetContainsBenchmark and trying hacking in quadratic probing, those results would be interesting at least.

I'm surprised to hear that a key suffered three re-probes.

Dimitris Andreou

unread,

Jan 12, 2010, 2:40:23 PM1/12/10

to guava-...@googlegroups.com

Oh, alright. I only cared to see table size and hash, I should have
read that more thoroughly - indeed, open addressing is a big
difference. That's why I sometimes prefer to use IdentityHashMap, when
applicable.

Then typically ImmutableSet will have bigger tables than HashSet, so
the test (in which I think I saw the table sizes happened to be equal,
64) is completely unfair - I presume the situation would look quite
better if the ImmutableSet of the test had, say, 33 elements, rathen
than 24. Agreed though, this is still in the domain of very small
sets, and the advantages of cache locality are not obvious yet.

Thanks for explaining this through Kevin! Now I'm even more happy with
the ImmutableSet implementation we have.

Kevin Bourrillion

unread,

Jan 12, 2010, 3:54:51 PM1/12/10

to guava-...@googlegroups.com

You're welcome!

Of course, I slyly focused only on ImmutableSet in my responses. We haven't yet put the same investment into ImmutableMap's performance and footprint, so it's not going to be any better than HashMap from that perspective.

Kevin Bourrillion

unread,

Jan 12, 2010, 6:40:07 PM1/12/10

to guava-...@googlegroups.com, Jesse Wilson

On Tue, Jan 12, 2010 at 9:48 AM, Kurt Alfred Kluever <k...@google.com> wrote:

For some further reading on performance testing, I'd highly recommend section 12.3 ('Avoiding performance testing pitfalls') of Java Concurrency in Practice (Goetz et al.)

Good tip.

Two of Brian's on-line articles also serve as a good introduction and should properly scare the crap out of you:

http://www.ibm.com/developerworks/java/library/j-jtp12214/

http://www.ibm.com/developerworks/java/library/j-jtp02225.html

Dimitris Andreou

unread,

Jan 12, 2010, 6:40:27 PM1/12/10

to guava-...@googlegroups.com

Ah, this talk always reminds me the poor state of java.util.HashSet
regarding memory footprint - why couldn't we have a lean and mean
mutable version too, apart from the immutable one?

As an exercise, I created a HashSet by embedding and dumbing down
HashMap source code, but it's tricky to actually make it lighter. This
is the entry object of HashMap:

static class Entry<K,V> implements Map.Entry<K,V> {
final K key;
V value;
Entry<K,V> next;
final int hash;

For a Set, we don't need the value field. But removing that field (I
have a typical enough machine) still leaves an Entry object at 24
bytes - no reduction. If I forgo the "hash" field too, then I get a
23% reduction in memory compared to java.util.HashSet.

I should have started with IdentityHashMap as a base :(

Benjamin Manes

unread,

Jan 12, 2010, 7:01:25 PM1/12/10

to guava-...@googlegroups.com

I haven't followed it, but there was some work by Doug Lea on core-libs-dev regarding a new implementation of HashMap for OpenJDK. If I recall correctly, it used a hybrid scheme of open and closed addressing to reduce memory footprint but avoid excess collisions of duplicate hashcodes for larger maps. If you wanted to put some effort into that, then you should probably look at the latest version Lea has. Its probably fair to expect Immutable* collections to be primarily used on small data sizes, which allows for a lot of nice optimizations, but for the core libs the usages vary wildly enough to need a different performance/memory balance. The Harmony folks might also have some good performance work, since I vaguely recall them contributing an improved TreeMap to OpenJDK.

Martin Buchholz

unread,

Jan 12, 2010, 9:00:49 PM1/12/10

to guava-...@googlegroups.com

On Tue, Jan 12, 2010 at 15:40, Dimitris Andreou <jim.a...@gmail.com> wrote:
> Ah, this talk always reminds me the poor state of java.util.HashSet
> regarding memory footprint - why couldn't we have a lean and mean
> mutable version too, apart from the immutable one?

I've had the same thoughts.

> As an exercise, I created a HashSet by embedding and dumbing down
> HashMap source code, but it's tricky to actually make it lighter. This
> is the entry object of HashMap:
>
> static class Entry<K,V> implements Map.Entry<K,V> {
> final K key;
> V value;
> Entry<K,V> next;
> final int hash;
>
> For a Set, we don't need the value field. But removing that field (I
> have a typical enough machine) still leaves an Entry object at 24
> bytes - no reduction. If I forgo the "hash" field too, then I get a
> 23% reduction in memory compared to java.util.HashSet.

Removing the "value" field has some value of course,
depending on the JVM. One would expect a reduction
from 7-8 words per key to 6-7 words per key.
I expect that the "hash" field has to stay, especially
if our aim is to actually replace java.util.HashSet.

The maintenance cost of duplicated code keeps one
from starting on this project, especially while other folks
still have grand plans for improving HashMap.

I believe IdentityHashMap uses open addressing
because the cost of a few extra comparisons is much lower
when comparison involves "==" instead of ".equals".

But I am still tempted to explore an implementation
based on a pair of arrays with open addressing,
one to store elements and one to store hashes,
for a space cost of about 4 words per key.

Martin

Dimitris Andreou

unread,

Jan 12, 2010, 10:17:48 PM1/12/10

to guava-...@googlegroups.com

Thanks for the interesting input, Martin!

I trusted too much Instrumentation#getObjectSize(), which shown that
the object remains at 24 bytes with or without a single field, but as
I read more carefully the javadocs, this is an approximation.

I was more or less following the discussion on a new HashMap, but I
haven't heard news on that for months. If HashMap itself changes, the
current HashSet will evolve automatically, but a HashSet replacement
would not do so as easily, and it would have to accommodate
LinkedHashSet somehow too. I was thinking more in terms of replacing
the implementation used in Sets#newHashSet() methods of guava (for
just the benefit of new code, ok).

I would think that open addressing does not mean more collisions if
the trade-off is striked at the right balance. If my rudimentary
calculations are right, linear probing would requite a load factor of
35% to have the same expected collisions as a chained hashtable with
load factor 75% (i.e. the default of HashMap). If one gets rids of
those Entry objects, it seems that the memory reduction will be just
about enough to compensate for the longer required tables (which are
going to be about twice the size), and then some more, so it should be
at least a slight win.

The easy solution obviously would be to ping Doug at some appropriate time. :)

Regards,

Dimitris

2010/1/13 Martin Buchholz <mart...@google.com>:

Martin Buchholz

unread,

Jan 20, 2010, 7:56:15 PM1/20/10

to guava-...@googlegroups.com

On Tue, Jan 12, 2010 at 19:17, Dimitris Andreou <jim.a...@gmail.com> wrote:
> Thanks for the interesting input, Martin!
>
> I trusted too much Instrumentation#getObjectSize(), which shown that
> the object remains at 24 bytes with or without a single field, but as
> I read more carefully the javadocs, this is an approximation.
>
> I was more or less following the discussion on a new HashMap, but I
> haven't heard news on that for months. If HashMap itself changes, the
> current HashSet will evolve automatically, but a HashSet replacement
> would not do so as easily, and it would have to accommodate
> LinkedHashSet somehow too. I was thinking more in terms of replacing
> the implementation used in Sets#newHashSet() methods of guava (for
> just the benefit of new code, ok).
>
> I would think that open addressing does not mean more collisions if
> the trade-off is striked at the right balance. If my rudimentary
> calculations are right, linear probing would requite a load factor of
> 35% to have the same expected collisions as a chained hashtable with
> load factor 75% (i.e. the default of HashMap). If one gets rids of
> those Entry objects, it seems that the memory reduction will be just
> about enough to compensate for the longer required tables (which are
> going to be about twice the size), and then some more, so it should be
> at least a slight win.
>
> The easy solution obviously would be to ping Doug at some appropriate time. :)

I did chat with Doug last week.
He has made some effort in the past to
make a much improved HashSet,
but he also did not come up with anything
sufficiently good to replace the current HashSet.