A Pauseless HashMap

1,629 views
Skip to first unread message

Gil Tene

unread,
Mar 30, 2014, 2:15:13 PM3/30/14
to
Some background: As those of you who have read my various rants may have noticed, I spend a lot of my time thinking about the behavior of latency/response-time/reaction-time. In addition to trying to understand and teach about the behavior better (with monitoring and measurement tools like HdrHistogram, LatencyUtils, and jHiccup), I actually work on things that makes that try to improve bad behavior (for some definitions of "improve" and "bad"). Eliminating pausing behavior in GC was the lowest hanging fruit, but more recent work has focused on eliminating pauses due to things like at-market-open deoptimzations, lock deflation, lock de-basing, and all sorts of TTSP (time to safepoint) issues. I've also learned a lot about how to bring down Linux's contribution to latency spikes.

But the JVM and the OS are not the only things that causes latency spikes. Sometimes your code is just "spiky". In my day job, I keep running into in actual, real-world low latency system code that is typically super-fast, but occasionally spikes in actual work latency due to some rare but huge piece of work that needs to be done, most often due to some state accumulation. After we eliminate GC pauses (which tend to dominate latency spikes, and which simply disappear immediately when Zing is applied), we often see this nice pattern of growing latency spikes at growing intervals, with a near-perfect doubling in both magnitude and interval between the spikes. This happens so often that we've studied the common causes, and (by far) the most common culprits are HashMaps used to accumulate something during the trading day. And it's all about resizing work.

I've had "build a Pauseless HashMap" on my weekend project list for over a year now, but finally got around to actually building it (at the request of someone on this list). There are probably at least 17 ways to skin a HashMap so it won't stall puts and gets when it resizes, but this is my simple take on it: 


Keep in mind that this is a "probably-working draft" that's gone through some bench testing, but is not yet battle hardened (scrutiny is welcome).

I intentionally based this version on Apache Harmony's version of HashMap, and not on OpenJDK's, in order to make it available without GPLv2 license restrictions (for those who may want to include it in non-GPL products). The background resizing concept itself is simple, and can be applied just as easily to the OpenJDK version (e.g. if some future Java SE version wants to use it). You can use (https://svn.apache.org/repos/asf/harmony/enhanced/java/trunk/classlib/modules/luni/src/main/java/java/util/HashMap.javaas a baseline comparison for the code I started with.

This is also a classic example of how GC makes this sort of concurrent programming thing both fast and simple. This is a classic case of an asymmetric speed need between two concurrent actors that share mutable state. I worked hard to make the fast path get() and put() cheap, and managed (I think) to not even use volatiles in the fast path. In doing this, I shifted all the cost I could think of to the background work, where latency doesn't matter nearly as much. This sort of trick would be much harder (or slower) to do if GC wasn't there to safely pick up the junk behind us, as it would (invariably, I think) add a need for additional synchronizing work in the fast path.





 

Roman Leventov

unread,
Mar 30, 2014, 7:16:25 AM3/30/14
to mechanica...@googlegroups.com
On the key thing, background resizing:
Comments in the code don't sufficiently describe relations between all 7 involved fields.
- Why sometimes pending resize is checked as if (!pendingResize), sometimes as if (resizingIntoElementData != null)?
- observedResizingIntoTable is set to false only in finishResizing(), finishResizing() is called only if backgroundResizeComplete is true,
backgroundResizeComplete is set to true at the end of Resizer runnable, spin-wait on observedResizingIntoTable to become false is in the beginning of Resizer -- how doesn't this lead to dead lock?

Misc:
1. Don't repeat mistakes of HashMap, class should be parameterized with expectedSize and loadFactor instead of capacity and loadFactor.
2. Increment modCount before structural modifications, compare yours clearImpl() with http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java#l859 and other places where modCount is incremented.
3. hashCode is computed twice in putImpl

воскресенье, 30 марта 2014 г., 8:11:10 UTC+4 пользователь Gil Tene написал:
Some background: As those of you who have read my various rants may have noticed, I spend a lot of my time thinking about the behavior of latency/response-time/reaction-time. In addition to trying to understand and teach about the behavior better (with monitoring and measurement tools like HdrHistogram, LatencyUtils, and jHiccup), I actually work on things that makes that try to improve bad behavior (for some definitions of "improve" and "bad"). Eliminating pausing behavior in GC was the lowest hanging fruit, but more recent work has focused on eliminating pauses due to things like at-market-open deoptimzations, lock deflation, lock de-basing, and all sorts of TTSP (time to safepoint) issues. I've also learned a lot about how to bring down Linux's contribution to latency spikes.

But the JVM and the OS are not the only things that causes latency spikes. Sometimes your code is just "spiky". In my day job, I keep running into in actual, real-world low latency system code that is typically super-fast, but occasionally spikes in actual work latency due to some rare but huge piece of work that needs to be done, most often due to some state accumulation. After we eliminate GC pauses (which tend to dominate latency spikes, and which simply disappear immediately when Zing is applied), we often see this nice pattern of growing latency spikes at growing intervals, with a near-perfect doubling in both magnitude and interval between the spikes. This happens so often that we've studied the common causes, and (by far) the most common culprits are HashMaps used to accumulate something during the trading day. And it's all about resizing work.

I've had "build a Pauseless HashMap" on my weekend project list for over a year now, but finally got around to actually building it (at the request of someone on this list). There are probably at least 17 ways to skin a HashMap so it won't stall puts and gets when it resizes, but this is my simple take on it: 


Keep in mind that this is a "probably-working draft" that's gone through some bench testing, but is not yet battle hardened (scrutiny is welcome).

I intentionally based this version on Apache Harmony's version of HashMap, and not on OpenJDK's, in order to make it available without GPLv2 license restrictions (for those who may want to include it in non-GPL products). The background resizing concept itself is simple, and can be applied just as easily to the OpenJDK version (e.g. if some future Java SE version wants to use it). You can use (https://svn.apache.org/repos/asf/harmony/enhanced/java/trunk/classlib/modules/luni/src/test/api/common/org/apache/harmony/luni/tests/java/util/HashMapTest.java) as a baseline comparison for the code I started with.

Vladimir Sitnikov

unread,
Mar 30, 2014, 7:27:17 AM3/30/14
to mechanica...@googlegroups.com
It is interesting how PauselessHashMap compares to http://javolution.org/target/site/apidocs/javolution/util/FastMap.html

Michael Hamrick

unread,
Mar 30, 2014, 8:38:39 AM3/30/14
to mechanica...@googlegroups.com

Before my starting into research PHM details and its source-code, could you confirm:

1. "Pauseless"  means less, not zero, stop-the-world GC pauses.  STW can still happen, right?
2.  runs on HotSpot with any specialized Collector/SurvivorRatio/etc JVM boot parameters?
3.  PHM stays  entirely on-heap and benefits from longterm background GC ?

Gil Tene

unread,
Mar 30, 2014, 1:10:07 PM3/30/14
to mechanica...@googlegroups.com
Answers inline


On Sunday, March 30, 2014 5:38:39 AM UTC-7, Michael Hamrick wrote:

Before my starting into research PHM details and its source-code, could you confirm:

1. "Pauseless"  means less, not zero, stop-the-world GC pauses.  STW can still happen, right?

Nothing to do with GC pauses. This is about the common HashMap operations (put and get) not stalling when a resize is triggered.
 
2.  runs on HotSpot with any specialized Collector/SurvivorRatio/etc JVM boot parameters?

Runs on any JVM. Can drop in anywhere HashMap is used and vice versa (no missing or added methods)
 
3.  PHM stays  entirely on-heap and benefits from longterm background GC ?

The 2-way concurrency (between the mutator and the background resizer) relies on GC to take care of things, and thereby gains speed for the mutator (no synchronization or volatile access needed in mutator fast/common path, even during resizing). 

Gil Tene

unread,
Mar 30, 2014, 1:20:51 PM3/30/14
to mechanica...@googlegroups.com
I haven't looked at FastMap before, but here are a few difference I see:

1. FastMap is thread safe. PauselessHashMap isn't (just like HashMap). This makes PauselessHashMap inherently faster for use in places where HashMap his currently being used.

2. FastMap Resizing is (apparently, per note quoted below) not done in the background. It is done "only when the map's size is small", and after that trades off mutator speed (in all access cases compared to HashMap) for avoiding large resizing stalls. In contrast, PauselessHashMap does an actual resize, and ends up with a fast flat resized hash map (which means speed is very similar to HashMap). Even during resize, PauselessHashMap access (which costs about double what it does outside of resize) is will be faster than a recursive sub-map access.

3. Think of HashMaps with *lots* of entries, that grow all day. That's an example of what PauselessHashMap was built to deal with, without slowing down or pausing.

Implementation Note: To maintain time-determinism, rehash/resize is performed only when the map's size is small (see chart). For large maps (size > 512), the map is divided recursively into (64) smaller sub-maps. The cost of the dispatching (based upon hashcode value) has been measured to be at most 20% of the access time (and most often way less).

Manik Surtani

unread,
Mar 30, 2014, 1:25:44 PM3/30/14
to mechanica...@googlegroups.com
I'm not sure if this works as a direct drop-in replacement.  Each instance kicks off 2 threads for background resizing.  An application with a moderate to large number of HashMap instances could quickly end up spawning many more threads than otherwise anticipated, causing issues elsewhere - in ways you wouldn't expect of "just a map".  I think any usage of a map like this should be with fully understanding the resource usage implications.

Also, I think it might be useful to be able to pass in a ScheduledExecutorService when constructing, so that they could be potentially shared among several such maps.

- Manik

Gil Tene

unread,
Mar 30, 2014, 1:35:11 PM3/30/14
to mechanica...@googlegroups.com


On Sunday, March 30, 2014 10:25:44 AM UTC-7, Manik Surtani wrote:
I'm not sure if this works as a direct drop-in replacement.  Each instance kicks off 2 threads for background resizing.  An application with a moderate to large number of HashMap instances could quickly end up spawning many more threads than otherwise anticipated, causing issues elsewhere - in ways you wouldn't expect of "just a map".  I think any usage of a map like this should be with fully understanding the resource usage implications.

It's not every instance:

static final ScheduledExecutorService backgroundResizers =
            Executors.newScheduledThreadPool(DEFAULT_NUMBER_OF_BACKGROUND_RESIZER_EXECUTOR_THREADS);
 

Also, I think it might be useful to be able to pass in a ScheduledExecutorService when constructing, so that they could be potentially shared among several such maps.

Would be a nice enhancement if drop-in replacement (in both direction) is not needed, and you didn't want to have any threads dedicate dot this class. Same for externally controlling the size of the background resizing pool that is already there. But 2 static & dormant threads for the whole class is not a big deal in most cases.

Ben Cotton

unread,
Mar 30, 2014, 1:40:07 PM3/30/14
to mechanica...@googlegroups.com
3. Think of HashMaps with *lots* of entries, that grow all day. That's an example of what PauselessHashMap was built to deal with, without slowing down or pausing.

Another case to think of is the HM  that "super sizes" everything upfront to delay as much as possible any downstream (operator triggered) re-size pauses.  The new PHM approach may help equip these cases with higher resolution tactics for managing their re-size pause stress tolerances.


There are probably at least 17 ways to skin a HashMap [...]

At least, indeed!  Funny how we always seem to re-visit HM with hopes to teach it new tricks.  Your approach here is very cool, Gil. 

Manik Surtani

unread,
Mar 30, 2014, 1:51:20 PM3/30/14
to mechanica...@googlegroups.com
On 30 March 2014 10:35, Gil Tene <g...@azulsystems.com> wrote:

static final ScheduledExecutorService backgroundResizers =
            Executors.newScheduledThreadPool(DEFAULT_NUMBER_OF_BACKGROUND_RESIZER_EXECUTOR_THREADS);

Whoops, I missed the static modifier.  Very different now... :) 


Gil Tene

unread,
Mar 30, 2014, 2:11:44 PM3/30/14
to mechanica...@googlegroups.com
Thanks for the input and comments. I'm glad to see such a detailed review. 


On Sunday, March 30, 2014 4:16:25 AM UTC-7, Roman Leventov wrote:
On the key thing, background resizing:
Comments in the code don't sufficiently describe relations between all 7 involved fields.

I'll add comments to clarify.
 
- Why sometimes pending resize is checked as if (!pendingResize), sometimes as if (resizingIntoElementData != null)?

Those two conditions are not equivalent. Operations that depend on a resize being in flight check pendingResize. Operations that depend on a resizingIntoElementData existing check for it. There is a (potentially long) period of time between pendingResize being set and resizingIntoElementData being populated. 

- observedResizingIntoTable is set to false only in finishResizing(), finishResizing() is called only if backgroundResizeComplete is true,
backgroundResizeComplete is set to true at the end of Resizer runnable, spin-wait on observedResizingIntoTable to become false is in the beginning of Resizer -- how doesn't this lead to dead lock?

It's not a dead-lock, but a potentially never-done resize, and a potential memory leak as a result.

If a resize is triggered and no further put access into the hash map is done after that, the background resizer will never progress past the spin-wait loop that waits for the mutator to indicate observedResizingIntoTable. This doesn't stop anyone else from doing anything, but the poor resizer task will keep polling on the observed condition every millisecond. A back off (to longer waits) in the wait loop is probably appropriate here.

The memory leak problem this raises would occur on the edge case where the very last put access to a hash map instanced triggered a resize, and the instance then "dies" from the mutator's point of view. In this situation, the background resizer will keep the instance alive while waiting for an observed indication that will never come... Fixing this probably requires the background resizer to "give up" on resizing on some timeout condition. A bit tricky (need to avoid deadlocking with mutator operations that block waiting for resize to complete), but I think it's doable...
 

Misc:
1. Don't repeat mistakes of HashMap, class should be parameterized with expectedSize and loadFactor instead of capacity and loadFactor.

PauselessHashMap is meant to be a drop-in replacement for HashMap. so I intentionally repeat all of it's externally visible interface mistakes ;-)
 
2. Increment modCount before structural modifications, compare yours clearImpl() with http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java#l859 and other places where modCount is incremented.

I started with Harmony's HashMap, and avoided trying to optimize anything in it that didn't have to do with the problem I was working on. I included a wrong link in my original posting above (now fixed). The baseline code is at:
and as you can see in clear() there, the modCount change in the loop comes from how Harmony chose to do this. The tradeoff here appears to be between how frequently you touch modCount in structural changes, and how lax your concurrent modification detection would be. I just wanted to avoid stepping on some assumption Harmony has in the code.

 
3. hashCode is computed twice in putImpl

Good catch. Thanks. I dropped it and replaced with comment about why it's not necessary.

Роман Левентов

unread,
Mar 30, 2014, 3:00:33 PM3/30/14
to mechanica...@googlegroups.com
On Sun, Mar 30, 2014 at 10:11 PM, Gil Tene <g...@azulsystems.com> wrote:
 
- Why sometimes pending resize is checked as if (!pendingResize), sometimes as if (resizingIntoElementData != null)?

Those two conditions are not equivalent. Operations that depend on a resize being in flight check pendingResize. Operations that depend on a resizingIntoElementData existing check for it. There is a (potentially long) period of time between pendingResize being set and resizingIntoElementData being populated. 

- observedResizingIntoTable is set to false only in finishResizing(), finishResizing() is called only if backgroundResizeComplete is true,
backgroundResizeComplete is set to true at the end of Resizer runnable, spin-wait on observedResizingIntoTable to become false is in the beginning of Resizer -- how doesn't this lead to dead lock?

It's not a dead-lock, but a potentially never-done resize, and a potential memory leak as a result.

If a resize is triggered and no further put access into the hash map is done after that, the background resizer will never progress past the spin-wait loop that waits for the mutator to indicate observedResizingIntoTable. This doesn't stop anyone else from doing anything, but the poor resizer task will keep polling on the observed condition every millisecond. A back off (to longer waits) in the wait loop is probably appropriate here.

The memory leak problem this raises would occur on the edge case where the very last put access to a hash map instanced triggered a resize, and the instance then "dies" from the mutator's point of view. In this situation, the background resizer will keep the instance alive while waiting for an observed indication that will never come... Fixing this probably requires the background resizer to "give up" on resizing on some timeout condition. A bit tricky (need to avoid deadlocking with mutator operations that block waiting for resize to complete), but I think it's doable...

Thanks for explanation, it is interesting.
 
 

Misc:
1. Don't repeat mistakes of HashMap, class should be parameterized with expectedSize and loadFactor instead of capacity and loadFactor.

PauselessHashMap is meant to be a drop-in replacement for HashMap. so I intentionally repeat all of it's externally visible interface mistakes ;-)

Probably no one ever used HashMap(int) constructor correctly. Everyone type `new HashMap(100)` when needs a HashMap with 100 entries. While `new HashMap((int) (100 / 0.75) + 1)` is correct. Therefore a switch to expectedSize could even improve someone's code. Moreover, I think it is not a big deal to change construction contract a bit, anyway class name should be replaced. Josh Bloch and Guava authors actively promote static factories, for example.
 
 
2. Increment modCount before structural modifications, compare yours clearImpl() with http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java#l859 and other places where modCount is incremented.

I started with Harmony's HashMap, and avoided trying to optimize anything in it that didn't have to do with the problem I was working on. I included a wrong link in my original posting above (now fixed). The baseline code is at:
and as you can see in clear() there, the modCount change in the loop comes from how Harmony chose to do this. The tradeoff here appears to be between how frequently you touch modCount in structural changes, and how lax your concurrent modification detection would be. I just wanted to avoid stepping on some assumption Harmony has in the code.

 As far as I understand Harmony and OpenJDK collection implementations originate from common source (Doug Lea's collections?). Now Harmony is abandoned a bit, OpenJDK is still improved. I remember commits when modCount increment movements, fixes of issues with int overflow on collection sizes near to Integer.MAX_VALUE. That is why Harmony implementations is not an "alternative view", but just slightly less robust, older version.

--  
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/DY8vysxdmj4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Georges Gomes

unread,
Mar 30, 2014, 6:14:53 PM3/30/14
to mechanical-sympathy
Gil was gentle enoug to share PauselessHashMap with us a few week ago and I have been comparing it to FastMap which was my low-pause benchmark.
Here is some finding:


FastMap starts at capacity 16.
Then resize (*2) up to 1024.
(load factor is 0.5 hard)

This is simply because resizing a 1024 element is not long.

After 512 elements (0.5 of 1024) then 64 sub-FastMaps are created and all entries are re-dispatched.
And so on...

Very clever. I have been a fan of this Map for years.
Concurrent, low pause, lock-free iterators, iterable like a LinkedHashMap and perform pretty well thanks to it's open addressing and 0.5 loadfactor I guess.
The Swiss Army Map.


Some resize figures:

FastMap(16) and put() 100K elements:
~630 resizes
Average 1us each

(Pretty good I would say)

FastMap(16) and put() 1M elements:
~14000 resizes
Averaging 10us each
Resizes start at 1us and end at 20-25us.


JDK HashMap in comparison:

HashMap(16) and put() 100K elements:
#1 resize : 1us
#2 resize : 1us
#3 resize : 0us
#4 resize : 2us
#5 resize : 3us
#6 resize : 5us
#7 resize : 13us
#8 resize : 24us
#9 resize : 50us
#10 resize : 116us
#11 resize : 199us
#12 resize : 405us
#13 resize : 856us
#14 resize : 2312us
PUT 9618us Size: 100000

HashMap(16) and put() 1M elements:
...
#10 resize : 137us
#11 resize : 288us
#12 resize : 497us
#13 resize : 888us
#14 resize : 2879us
#15 resize : 10346us
#16 resize : 34031us
#17 resize : 75884us
PUT 211627us Size: 1000000

100K to 1M elements only adds 3 resizes but 10ms, 34ms and 75ms for these!
(keep in mind FastMap is 0.5 loadfactor and HashMap is 0.75)


The PauselessHashMap of Gil: None of these pauses.
Very nice!
Put()s are obviously a little bit slower during the resize, but in real life, the put()s will be spread over time and only a few of them will really be impacted by the resize kicking in background.
The engineering put in place to have the fast path quick is very nice and works very well.
There is a lot of potential here.

Thanks for sharing this Gil!

Cheers
Georges


Note: I also tested NonBlockingHashMap from Cliff Click that does "collaborative" resize but it doesn't like 1M put()s :) I think it needs some get()s to get the resize work properly... I didn't dig at all.




--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Mar 31, 2014, 12:46:21 AM3/31/14
to mechanica...@googlegroups.com
1) Can you share the test code for these results?

2) Did i miss the performance tests in the source  ? or was it left on purpose ?

3) Did you consider latency/throughput for inserts and/or remove ? or was it left on purpose ?

In  a real world scenario no one would be inserting millions of items in a batch normally you would do a mix of inserts and removals. But I think the how and why you (or Gil) tested this in a particular way would be "more" interesting to me :-)

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Sand Stone

unread,
Mar 31, 2014, 12:49:05 AM3/31/14
to mechanica...@googlegroups.com
I have to say the title of this email thread of great interests to me as well. 

The following test code on my Mac took about 165 seconds. I can see visible pauses in between the "millions" print after 10M. Also there is a minor issue that the process could not quit due to background threads. 

====

java -version

java version "1.7.0_51"

Java(TM) SE Runtime Environment (build 1.7.0_51-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


 java -cp -server -Xmx20g -cp .:./PauselessHashMap-1.0-SNAPSHOT.jar IntTest


import org.giltene.*; 

public class IntTest {

  public static void main(String[] args) {
    
    PauselessHashMap<Integer,Integer> intmap = new PauselessHashMap<Integer, Integer>(4096);
    int count = 150000000;
    System.out.println("adding");
    long t1 = System.nanoTime();
    for (int i=0; i<count;i++) {
      if (i%1000000==0) 
        System.out.println(i);
      intmap.put(i,i);
    }
    long t2 = System.nanoTime();
    System.out.println("took(secs):" + (t2-t1)/1e9);
  }
}

P.S: As a comparison to another hashtable implementation I could not share, the similar test took 16 seconds on the same machine, with one thread total. And no, the implementation is not tuned for the "int" type. 

Sand Stone

unread,
Mar 31, 2014, 12:51:16 AM3/31/14
to mechanica...@googlegroups.com

>In  a real world scenario no one would be inserting millions of items in a batch normally you would do a mix of inserts and removals. 
That's not true if one intends to use the hash map to cache data on the server side. 


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,
Mar 31, 2014, 1:26:02 AM3/31/14
to mechanica...@googlegroups.com


On Monday, March 31, 2014 12:51:16 AM UTC-4, Sand Stone wrote:

>In  a real world scenario no one would be inserting millions of items in a batch normally you would do a mix of inserts and removals. 
That's not true if one intends to use the hash map to cache data on the server side. 


My question is more like "1) do you need to test when multiple operations are involved and 2) how would you test that "?

 

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sand Stone

unread,
Mar 31, 2014, 2:11:58 AM3/31/14
to mechanica...@googlegroups.com
Got you. 

For the scenarios I care mostly, they are dominated by inserts and lookups (and mutate the values). So maybe 10s of millions inserts and lookups, and with batch removal of 10k-100k every once a while (say 10s of seconds).  

My scenarios are not the "typical" JDK use cases, and the CPU time needs to be regulated at the system level (concurrent libraries tend to eat a lot of extra CPU time). Unlike various sorting benchmarks (tera-sort, minute-sort, joule sort), I am not aware of any standalone standardized benchmarks. TPC-C etc. are database focused. 

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Mar 31, 2014, 4:38:44 AM3/31/14
to mechanica...@googlegroups.com

An all out test of puts per second is not how I'd look at this sort of thing. For a few reasons:

- First: Since this is in no way an attempt at making a faster HashMap, it's an attempt to make one that doesn't glitch when you resize. So unless you are measuring the glitches, you aren't measuring what it does.

- Second: Because it's not an attempt at a speedier has map, comparing it's common-case speed (right now) to anything other than the baseline code it was based on (Harmony's java.util.HashMap, see link in original posting above) is also useless. It's no different than comparing Harmony's HashMap against your inetersting-but-different-hashmap that we don't have source for. You speed differences will more likely come from different internal layout (e.g. open addressing vs. separate chaining with linked lists vs. some other intrerna form) that anything else.

- Third, an all-out insert-as-fast-as-possible test is really not the best way to test a background resizer, unless that's what you actually expect your application to be doing, in which case I doubt you need a background resizer... The reason is that you are likely to outrun the resizer with your puts, leading to both inefficient puts (the hash map is going to be smaller than it should be most of the time) and to back-to-back and wasteful resizing. The typical application for this sort of thing would accumulate and grow the hash map at some rate, but not at an all-out-all-the-time rate. E.g. on my Mac, PauselessHashMap will happily take ~10Million new entries a second. But I wouldn't test it under insert rates of more than ~100K to 1M per second if I wanted to see how it resizes without glitching.

- GC. GC. GC. My guess is that GC behavior dominates your test. It certainly did when I ran it on my Mac (which doesn't have room for a 20GB heap). puts were going at >10M per second between GC pauses. If you turn on verbose gc, I bet you'll see those glitches, and if you size and tune your heap and your GC setting for the test you are doing so that it won't thrash, you may be able to sustain that for a while.

I'll try to put together some simple perf tests that will run at a given rate, measure operation latencies, record them in a histogram, and report the outcome comparing Harmony's HashMap with PauselessHashMap. (Already added OriginalHashMap to tests in prep for this)...
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kirk Pepperdine

unread,
Mar 31, 2014, 4:53:30 AM3/31/14
to mechanica...@googlegroups.com

On Mar 31, 2014, at 10:38 AM, Gil Tene <g...@azulsystems.com> wrote:

>
> An all out test of puts per second is not how I'd look at this sort of thing. For a few reasons:
>
> - First: Since this is in no way an attempt at making a faster HashMap, it's an attempt to make one that doesn't glitch when you resize. So unless you are measuring the glitches, you aren't measuring what it does.
>
> - Second: Because it's not an attempt at a speedier has map, comparing it's common-case speed (right now) to anything other than the baseline code it was based on (Harmony's java.util.HashMap, see link in original posting above) is also useless. It's no different than comparing Harmony's HashMap against your inetersting-but-different-hashmap that we don't have source for. You speed differences will more likely come from different internal layout (e.g. open addressing vs. separate chaining with linked lists vs. some other intrerna form) that anything else.
>
> - Third, an all-out insert-as-fast-as-possible test is really not the best way to test a background resizer, unless that's what you actually expect your application to be doing, in which case I doubt you need a background resizer…

This is a case that I’ve seen with the Scala parser. It’s name spaces are implemented as a HashMap and for large compiles hash map resizing dominates.. emasculates everything else in the profile. I’d almost suggest that this might be a good test case (in a suite of test cases)… Even if the background resizer is over run in this situation it would be helpful to understand how it might improve the performance of a compile. In my last test of the Scala compiler (which was a while ago), I’d estimate that 3-4 minutes of the 7 minute compile was due to symbol table resizing. You might want to talk to Bill Venners. Also, you might want to try to setup the bench using JMH (http://openjdk.java.net/projects/code-tools/jmh/).

Regards,
Kirk

Gil Tene

unread,
Mar 31, 2014, 6:21:44 AM3/31/14
to <mechanical-sympathy@googlegroups.com>
PauselessHashMap doesn't make resize cheaper, and it doesn't make hashmaps faster. It doesn't make any averages better, either. It makes outliers go away by moving their cost sideways, to a background operation, so that get and put latency can remain consistent. The total cost (including the resize) will likely remain the same or grow.

As such, I don't expect it to score better on any throughput metric. Including "how long does it take to put 150m entries in a hashmap", and "how long does it take to parse X". It's all about "what is the longest put operation the hashmap will show?", and "what is the reaction time behavior of code that uses this hashmap?" (Where "behavior" describes the entire percentile spectrum for reaction time to a random, uncoordinated event).

Georges's posting shows the sort of thing a pauseless hashmap is good for. Without measuring the timing of individual operations, it a pointless to discuss the other metrics.

Sent from Gil's iPhone

> On Mar 31, 2014, at 1:53 AM, "Kirk Pepperdine" <ki...@kodewerk.com> wrote:
>
>
>> On Mar 31, 2014, at 10:38 AM, Gil Tene <g...@azulsystems.com> wrote:
>>
>>
>> An all out test of puts per second is not how I'd look at this sort of thing. For a few reasons:
>>
>> - First: Since this is in no way an attempt at making a faster HashMap, it's an attempt to make one that doesn't glitch when you resize. So unless you are measuring the glitches, you aren't measuring what it does.
>>
>> - Second: Because it's not an attempt at a speedier has map, comparing it's common-case speed (right now) to anything other than the baseline code it was based on (Harmony's java.util.HashMap, see link in original posting above) is also useless. It's no different than comparing Harmony's HashMap against your inetersting-but-different-hashmap that we don't have source for. You speed differences will more likely come from different internal layout (e.g. open addressing vs. separate chaining with linked lists vs. some other intrerna form) that anything else.
>>
>> - Third, an all-out insert-as-fast-as-possible test is really not the best way to test a background resizer, unless that's what you actually expect your application to be doing, in which case I doubt you need a background resizer...
>
> This is a case that I've seen with the Scala parser. It's name spaces are implemented as a HashMap and for large compiles hash map resizing dominates.. emasculates everything else in the profile. I'd almost suggest that this might be a good test case (in a suite of test cases)... Even if the background resizer is over run in this situation it would be helpful to understand how it might improve the performance of a compile. In my last test of the Scala compiler (which was a while ago), I'd estimate that 3-4 minutes of the 7 minute compile was due to symbol table resizing. You might want to talk to Bill Venners. Also, you might want to try to setup the bench using JMH (http://openjdk.java.net/projects/code-tools/jmh/).
>
> Regards,
> Kirk
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/DY8vysxdmj4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Benedict Elliott Smith

unread,
Mar 31, 2014, 8:03:30 AM3/31/14
to mechanica...@googlegroups.com
So, following this discussion it wasn't clear to me why this approach would be better than simply spreading the cost of the migration over the inserts that follow its initiation (and possibly gets/removes), so I hacked up an alternative "pauseless" hashmap that does this. It still uses a single separate thread to off-load the allocation of large arrays, which are still costly operations. I have done some very simple (hacky) benchmarking with a best effort to eliminate GC (making YG larger than the space necessary for the test) and throwing as many put operations at each map as I could, and found that generally this approach seems to keep the worst latency experienced lower than Gil's, whereas the <99%ile is better with Gil's approach. Which is hardly a massive surprise. The advantage is that fewer total system resources are required (10-40% fewer cycles across all threads)

Anyone who is interested / fancies doing more extensive comparisons: https://github.com/belliottsmith/scratch/blob/master/src/org/belliottsmith/PauselessHashMap.java

Note it only supports put/get, as that's all I was interested in for the moment. Only puts share the burden of migration, but it would be simple to make it configurable for lookups to also get involved to ensure it always completes in a reasonably timely fashion.


You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vladimir Sitnikov

unread,
Mar 31, 2014, 9:52:52 AM3/31/14
to mechanica...@googlegroups.com
Benedict Elliott Smith: spreading the cost of the migration over the inserts that follow its initiation (and possibly gets/removes)

This approach looks like a lot easier to program: you do not need to synchronize concurrent read/writes from multiple threads (the background resizer vs mutator).
In azul you do not even require that background allocator thread.
The drawback of that "pauseless" is that you'll keep old Object[] array longer (no problem for azul, a promotion problem for OpenJDK)

Gil,
Here are some more questions:
1) Did you consider using less non-static fields? I understand you might want to leak "the first dose of pauseless", then suggest azul when users realize all their PauselessMaps consume all the RAM :)
As far as I understand, reading volatile fields does not involve much overhead, thus there should be no much improvement out of using non-volatile + volatile pair. That is volatile update that require overhead of false sharing, cache line contention, etc.

2) Entry#isvalid is used inconsistent in findNullKeyEntryInChain vs findNonNullKeyEntryInChain. Is that intentional?

3) Reader thread has absolutely no synchronization in respect to values read out of newly created Entry objects in background. There is no synchronization when accessing resizingIntoElementData either.
Typical JDK implementation allows to reason on the concurrency correctness in terms of JLS, thus it is correct for a valid JMM implementation. Your code is written in terms of memory barriers, thus it is not clear if it is bullet-proof.

4) It is very hard to trace if if (result == null) {result = resizingIntoEntry.value;} that overrides the previous value of  result = entry.value; is fine.
On the top of my head I can't find a case when this would result in invalid "result". However, if both those values are always the same, then why bother with if (result == null) {result = resizingIntoEntry.value;}?
The fact that I especially do not like is that result==null is a valid value (i.e. null was stored in the Map), then you might rewrite that result=null with some unknown resizingIntoEntry.value.

Vladimir

Benedict Elliott Smith

unread,
Mar 31, 2014, 10:05:06 AM3/31/14
to mechanica...@googlegroups.com
This approach looks like a lot easier to program: you do not need to synchronize concurrent read/writes from multiple threads

Exactly why I wondered why it wasn't the initially addressed approach. It's a pretty common pattern for reducing worst case behaviour. It also can happily (and more easily) co-exist with other optimisations, for exactly the same reason of not worrying about synchronisation (e.g. managing a tree of buckets to mitigate worst case complexity). This was just a very initial hack to check it behaved as it should.
 
The drawback of that "pauseless" is that you'll keep old Object[] array longer (no problem for azul, a promotion problem for OpenJDK)

It's debatable if the promotion cost is likely to be an issue even on an Oracle VM: The table is going to be a small percentage of garbage production, and the larger tables will get allocated in tenured anyway.

I think this is a better price to pay; memory is cheaper than CPU time, and all it really does is affect your load factor. There are optimisations that could be done to mitigate this effect as well.




--

Kirk Pepperdine

unread,
Mar 31, 2014, 10:52:23 AM3/31/14
to mechanica...@googlegroups.com
I take your point s but what looking for is to side step the resize costs so I don’t take a direct hit for a resize. Which means the parser keeps going while the hashmap executes the maintenance step of resizing.

— Kirk
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Sand Stone

unread,
Mar 31, 2014, 11:20:28 AM3/31/14
to mechanica...@googlegroups.com
Gil, it's not my intent to diminish your code. I am more excited about this topic of "pauseless hash map". You helped me to confirm my hypothesis that use concurrent approach is not a good way to reduce the hash map resizing pause (let's call it H-pause). Indeed I wouldn't do a better job at all. 

From the test code, you should see/feel the large variance of pause time and jittery in between the prints. I picked 150M for a few reasons:
     *  it's big enough (my real scenario is actually much bigger), 
     * it's small enough to fit within the memory of my workstation instead of running on some server. 
Based on my experience I would claim that people who really need pauseless hashmaps are doing millions of inserts/puts constantly. 

Given this is a mechanical-sympathy group, I would say single thread still has a lot of potential: low variance of H-pause time, all achieved with a fraction of overall CPU time and 1/3 of memory budget. 


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Georges Gomes

unread,
Mar 31, 2014, 12:35:24 PM3/31/14
to mechanical-sympathy

In a world with more and more cores that are difficult to use efficiently: I think this off-load-to-thread approach very interesting.

And I think this technic is more cache friendly because fast-path threads are doing the minimum and background threads concentrate around the resize.

Still, the single-thread is quite nice.
I would happily enjoy a head-to-head pauselessHM in this group to find the best possible implementation for a few scenarios/benchmarks :)

@Gil would you be OK to accept requests of alternatives impl / benchmarks to your github project?
I would happily contribute!

Would be a nice playground for everybody I think :)

Cheers
GG

Rüdiger Möller

unread,
Mar 31, 2014, 1:27:15 PM3/31/14
to mechanica...@googlegroups.com


Am Montag, 31. März 2014 18:35:24 UTC+2 schrieb Georges Gomes:

In a world with more and more cores that are difficult to use efficiently: I think this off-load-to-thread approach very interesting

^ This. Its actually an easy strategy to improve throughput/reduce latency, while still maintaining a single threaded processing logic. Similar strategies could be applied to other foundation classes, e.g. growing ArrayList's, sorting, en-/decoding (not sure if break even can be achieved always in these examples ..).

Benedict Elliott Smith

unread,
Mar 31, 2014, 1:27:33 PM3/31/14
to mechanica...@googlegroups.com
And I think this technic is more cache friendly because fast-path threads are doing the minimum and background threads concentrate around the resize.

Well, I'm not really convinced by this position. There are all kinds of things you can tweak with a progressive migration approach: if you decide to always force all put operations to hit the old table if they hit indexes after the migration position, then only the used portion of either table will be in cache, so it will behave no worse than a regular hashmap in this regard. By contrast the multi-threaded approach will have cache coherency traffic to contend with. FTR I made and uploaded this modification to the map I created - it hits put throughput but presumably due to increased collisions, but this could be a trivially configurable option for those who care about cache behaviour. If that's a goal, it's probably better to use open-address hashing though, and this may be more difficult to do in steps.

Gil: not sure if there's a bug in your hashmap, but once I started including GC cleanup costs as a separate metric, I found that your map was producing a metric ton of GC work. 10M inserts produces ~15s flat-out-all-CPUs GC activity for me. So whilst the map itself has slightly higher throughput than the progressive approach when GC is ignored and there is no other load, the total cost is several magnitudes higher. I assume this must be a bug though.


Kirk Pepperdine

unread,
Mar 31, 2014, 1:41:27 PM3/31/14
to mechanica...@googlegroups.com
Hi Gil,


>
> Georges's posting shows the sort of thing a pauseless hashmap is good for. Without measuring the timing of individual operations, it a pointless to discuss the other metrics.

I’ve a workload to do and unless this offers me some advantage over HashMap to reduce the overall time it takes to crunch through that workload I’m wondering, what’s the point? After all isn’t jitter another form of dead time and shouldn’t reducing dead time equate to faster processing of a workload?

Regards,
Kirk

>
> Sent from Gil's iPhone
>
>> On Mar 31, 2014, at 1:53 AM, "Kirk Pepperdine" <ki...@kodewerk.com> wrote:
>>
>>
>>> On Mar 31, 2014, at 10:38 AM, Gil Tene <g...@azulsystems.com> wrote:
>>>
>>>
>>> An all out test of puts per second is not how I'd look at this sort of thing. For a few reasons:
>>>
>>> - First: Since this is in no way an attempt at making a faster HashMap, it's an attempt to make one that doesn't glitch when you resize. So unless you are measuring the glitches, you aren't measuring what it does.
>>>
>>> - Second: Because it's not an attempt at a speedier has map, comparing it's common-case speed (right now) to anything other than the baseline code it was based on (Harmony's java.util.HashMap, see link in original posting above) is also useless. It's no different than comparing Harmony's HashMap against your inetersting-but-different-hashmap that we don't have source for. You speed differences will more likely come from different internal layout (e.g. open addressing vs. separate chaining with linked lists vs. some other intrerna form) that anything else.
>>>
>>> - Third, an all-out insert-as-fast-as-possible test is really not the best way to test a background resizer, unless that's what you actually expect your application to be doing, in which case I doubt you need a background resizer...
>>
>> This is a case that I've seen with the Scala parser. It's name spaces are implemented as a HashMap and for large compiles hash map resizing dominates.. emasculates everything else in the profile. I'd almost suggest that this might be a good test case (in a suite of test cases)... Even if the background resizer is over run in this situation it would be helpful to understand how it might improve the performance of a compile. In my last test of the Scala compiler (which was a while ago), I'd estimate that 3-4 minutes of the 7 minute compile was due to symbol table resizing. You might want to talk to Bill Venners. Also, you might want to try to setup the bench using JMH (http://openjdk.java.net/projects/code-tools/jmh/).
>>
>> Regards,
>> Kirk
>>
>> --
>> You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/DY8vysxdmj4/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Benedict Elliott Smith

unread,
Mar 31, 2014, 1:49:39 PM3/31/14
to mechanica...@googlegroups.com
 I've a workload to do and unless this offers me some advantage over HashMap to reduce the overall time it takes to crunch through that workload I'm wondering, what's the point? After all isn't jitter another form of dead time and shouldn't reducing dead time equate to faster processing of a workload?

This is a different problem. You're addressing total throughput; jitter is important for users with tight SLAs on latency for an operation. These users are usually quite willing to accept lower throughput in order to meet their SLAs.

Georges Gomes

unread,
Mar 31, 2014, 2:28:16 PM3/31/14
to mechanical-sympathy

That's my case.

On Mar 31, 2014 7:49 PM, "Benedict Elliott Smith" <bellio...@datastax.com> wrote:
 I've a workload to do and unless this offers me some advantage over HashMap to reduce the overall time it takes to crunch through that workload I'm wondering, what's the point? After all isn't jitter another form of dead time and shouldn't reducing dead time equate to faster processing of a workload?

This is a different problem. You're addressing total throughput; jitter is important for users with tight SLAs on latency for an operation. These users are usually quite willing to accept lower throughput in order to meet their SLAs.

--

Gil Tene

unread,
Mar 31, 2014, 7:02:15 PM3/31/14
to mechanica...@googlegroups.com
As another way of skinning a HashMap to avoid pauses, this incremental approach is a good alternative. But I don't think there will/should be any better in either the worst latencies or the common case speed. A side-by-side comparison of the approaches on the (otherwise) same get/put logic is probably a good way to check that.

Comparing the two approached at the high level:

- Both approaches use a background thread for the allocation of the target table.

Allocation:
- The incremental copy approach does not allocate new entries during the copy.
- The background copy approach allocates new entries for the copy. Discards old ones when copy is done.

- The current org.giltene version allocates an additional array of a similar size to the target array in order to minimize the need for uniqueness search within buckets. This allocation is not absolutely necessary, and can be avoided at the cost of forcing each background copy to scan each bucket chain for uniqueness in every case (so it's a background cpu vs. background allocation and collection cost tradeoff).

Put:
- The incremental copy approach does a put only in the target table
- The background copy approach updates both the source and target tables, inserts only in the target

- The incremental approach bears the cost of doing multiple copy operations for each put
- The background approach takes no copy cost in the put, other than forcing the existence of a copy of the updated entry into the target

Get:
- The incremental copy approach looks up in both source and target tables (see note below)
- The background copy approach looks up in both source and target tables 

Other:
- The (current) incremental copy approach is more susceptible to increased latency in bad hash codes situations, since it copies 4 complete buckets for each put. This can be mitigated by limiting the number of entries (rather than buckets) copied per put, without incurring much added cost (since the get needs to scan both source and target anyway to be correct).

I expect for the GC pressure to be high with the background approach. I'd expect the common case (not during resize) costs to be the same for both approaches, but I would expect the speed of put() during resizing to be faster in the background copy case, while the cost of a get() during a resize is similar in both (at about 2x the cost of a get when no resize is happening). However, since the background copy operations will complete faster, the number of get and put operations impacted by the resize should be significantly lower in the background case, and this relative benefit grows as the rate of access drops (e.g. at 50K ops/sec the relative benefit of background copy will be much bigger difference than it would be at 5M ops/sec).

Notes for the code:
- Your get is currently doing a lookup either in the source table or in the target table, which I think is a bug. At least for the >= migrated range, you need to look up in both, since your put implementation forces whatever bucket it hits to migrate to the target table.
- Your code does not handle null keys right now, which I think is the main cause of code complexity difference.



> To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Benedict Elliott Smith

unread,
Mar 31, 2014, 7:35:44 PM3/31/14
to mechanica...@googlegroups.com
- Your get is currently doing a lookup either in the source table or in the target table, which I think is a bug. At least for the >= migrated range, you need to look up in both, since your put implementation forces whatever bucket it hits to migrate to the target table.
- The incremental copy approach looks up in both source and target tables (see note below)
while the cost of a get() during a resize is similar in both (at about 2x the cost of a get when no resize is happening). However, since the background copy operations will complete faster, the number of get and put operations impacted by the resize should be significantly lower in the background case

None of the above are true in the current version, but I agree there was an oversight (bug) when I hacked it together in the first version. The current version knows exactly which table to look in, so never incurs a higher get cost, which I think is probably the better option, at the expense of slightly more expensive puts.

This can be mitigated by limiting the number of entries (rather than buckets) copied per put, without incurring much added cost (since the get needs to scan both source and target anyway to be correct).

I did consider this, but ultimately decided if you're susceptible to bad hash functions, this is going to be of minimal benefit since you'll incur those linear costs every time you insert to the bucket anyway, so incurring it during grow isn't much worse - and the code complexity is non-trivial for minimal gain. The correct solution for this in my opinion is to maintain a tree (split-ordered on hash, then by some comparator) for conflict resolution - this makes migration of a bucket an O(1) event as well as bounding lookups.

Your code does not handle null keys right now, which I think is the main cause of code complexity difference.

Well, null keys are not very difficult to handle in the single-threaded case, but I'm not convinced supporting them is a good thing anyway. Null values are another matter (I also don't handle them) which are arguably more useful, but I'm also not at all convinced they should be supported.

However, since the background copy operations will complete faster, the number of get and put operations impacted by the resize should be significantly lower in the background case, and this relative benefit grows as the rate of access drops

Absolutely. But also on how many spare cores you have going (predictably) idle: As it stands the higher CPU cost from GC burden is very substantial (>100x GC burden, or about 10-15x total CPU-time burden). On the topic of which, I would expect a higher cost of GC for the background approach, but ~100x cost is surprising - perhaps the background threads could dismantle the old tables once they're known expired to make GC's life easier? Otherwise, as it stands, given the similar overall latency characteristics of the incremental approach, it doesn't seem a reasonable tradeoff for most use cases. 

In general I would expect that keeping more cores idle for other work is likely to be of greater benefit to keeping the overall system latency down. However if you have only a handful of maps to maintain, plenty of spare cores, and you care both about the worst latency and maintaining the lowest possible latency for any individual operation, then the background approach is certainly preferable - even if the rate of message arrival is high.



 


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Mar 31, 2014, 9:08:05 PM3/31/14
to mechanica...@googlegroups.com


On Monday, March 31, 2014 6:52:52 AM UTC-7, Vladimir Sitnikov wrote:
Benedict Elliott Smith: spreading the cost of the migration over the inserts that follow its initiation (and possibly gets/removes)

This approach looks like a lot easier to program: you do not need to synchronize concurrent read/writes from multiple threads (the background resizer vs mutator).
In azul you do not even require that background allocator thread.
The drawback of that "pauseless" is that you'll keep old Object[] array longer (no problem for azul, a promotion problem for OpenJDK)

Gil,
Here are some more questions:
1) Did you consider using less non-static fields? I understand you might want to leak "the first dose of pauseless", then suggest azul when users realize all their PauselessMaps consume all the RAM :)

Not really (for the non-static field question). The vast bulk of the storage volume is in the table array and the entries, not in the HashMap object body, so the number of fields is not critical there. Especially for booleans.

As for the relation to Zing, and whether this is part of an agenda: This is a different sort of pauseless operation, and has nothing to do with GC. As such, it relates to all JVMs. Ask anyone who is experiencing a 120msec synchronous resize in a low latency system... On Zing such resize-indiced outliers are more visible because we've already taken care of the GC related glitches, which makes these remaining ones stand out, and make us (and out customers) pay more attention to them. 

But since the background copy does rely on GC for making the fast path faster that it would be with foreground copy approaches, I guess this approach does show Zing off, since Zing makes GC pressure a non-issue. It's a classic case where good GC capability (where you stop worrying about pauses and the impact of GC pressure) allows you to make your fast code faster, and all you pay for that extra speed is some extra background CPU cycles (which low latency systems tend to have plenty to spare).
 
As far as I understand, reading volatile fields does not involve much overhead, thus there should be no much improvement out of using non-volatile + volatile pair. That is volatile update that require overhead of false sharing, cache line contention, etc.

Reading volatile fields doesn't force the CPU to fence in x86 in itself, but it does force the JIT compiler to avoid reordering other read across the volatile read. This can lead to some significant costs, especially in forcing the re-reading of otherwise register-cachable values. See Nitsan's blog entry demonstrating this in generated code for details of an example, showing how even instance-final fields get re-read after a volatile read.

So it's the potential cost to the surrounding code (which may inline the put and get code) that I'm avoiding but not using volatiles here...
You are right. Fixed it. Thanks!
 

3) Reader thread has absolutely no synchronization in respect to values read out of newly created Entry objects in background. There is no synchronization when accessing resizingIntoElementData either.

That is intentional in both cases.

For access to resizingIntoElementData, the observation acknowledgement (and the wait loop awaiting it's indication in the background resizer) allows us to avoid synchronizing on the array itself. Chains that hang off of the array are synchronized with a CAS of the chain head, used by both the put and the background copy (for inserts into the target array). Since during background resizing the chains can only grow, and only by adding to their heads, there is not need to synchronize on anything other than the head.

Newly created Entry objects don't need explicit synchronization, because:

1. Entries (newly) created by the mutator are never read by the background resizer.

and 

2. Entries (newly) created by the background resizer will only be visible after a CAS on the chain head in resizingIntoElementData. A successful  CAS has the same memory effects of both reading and writing a volatile variable (per java.util.concurrent.atomic). On the receiving side, you will either see the ref to the newly allocated Entry, or you won't, and the CAS means that the entry's contents are guaranteed to be made visible before the reference to it is. Since the Entry is newly allocated, there is no question about it's not-fully-backed contents being potentially cached by the reading side.
  
Typical JDK implementation allows to reason on the concurrency correctness in terms of JLS, thus it is correct for a valid JMM implementation. Your code is written in terms of memory barriers, thus it is not clear if it is bullet-proof.

If you think relying on lazySet() and CAS for barriers and ordering is wrong, look under the hood at the various java.util.concurrent implementations... ;-). Yes, the strict JMM spec doesn't say *anything* about lazySet, but there is lots of stuff that would break in concurrent collections if a JVM were to mess with the expected StoreStore ordering ahead of a lazySet, and with it's visibility implications to code that reads stuff on the other end of such barriers.

Still, from an algorithm point of view, there may well be visibility ordering bugs lurking in there, and that I just *think* I covered them all.
 
4) It is very hard to trace if if (result == null) {result = resizingIntoEntry.value;} that overrides the previous value of  result = entry.value; is fine.
On the top of my head I can't find a case when this would result in invalid "result". However, if both those values are always the same, then why bother with if (result == null) {result = resizingIntoEntry.value;}?

The background resizer clears each bucket chain in the source table after it completes copying it. An example case where result would be null when looked up in the source and would need to be replaced by whatever the target table's old value is this: The entry already exists in the target table, but is no longer there in the source table (because the bucket chain was cleared).
 
The fact that I especially do not like is that result==null is a valid value (i.e. null was stored in the Map), then you might rewrite that result=null with some unknown resizingIntoEntry.value.

As you note, the result == null situation is normal for an inserting put (HashMap allows for both null keys and null values). But it should only be null for a non-inserting put if it replaces a value of null in an existing entry.

Note that the { result = resizingIntoEntry.value;  } part occurs before resizingIntoEntry.value = value; , so you would get a null unless there was already an entry in the target array, and that entry had a non-null value, in which case you are replacing the old value there, and that's the value that should be (and is) returned by the code.



Vladimir

Kirk Pepperdine

unread,
Apr 1, 2014, 2:22:38 AM4/1/14
to mechanica...@googlegroups.com
>
> As for the relation to Zing, and whether this is part of an agenda: This is a different sort of pauseless operation, and has nothing to do with GC. As such, it relates to all JVMs. Ask anyone who is experiencing a 120msec synchronous resize in a low latency system... On Zing such resize-indiced outliers are more visible because we've already taken care of the GC related glitches, which makes these remaining ones stand out, and make us (and out customers) pay more attention to them.

In fact I believe that as we tackle the GC pause time problem we’re going to see TTSP for other VM maintenance to trickle up the performance stack. IMHO not enough thought has been put into avoiding safe pointing in order for the JVM to clear out some built up state. You don’t need much safe-pointing to occur before you seriously limit your ability to scale across a minimal number of cores.

I don’t see this HashMap as addressing this particular problem. IME, I don’t need a faster HashMap (though I suppose one would be nice), in fact most of the data structures are plenty fast enough for the vast majority of things we do. The problem is, maintenance work gets in the way. I see this as pushing maintenance work out of the processing threads way so that thread can just get on with it. That, in it’s self, is a huge win.

Regards,
Kirk

Josh Humphries

unread,
Apr 14, 2014, 12:09:55 AM4/14/14
to mechanica...@googlegroups.com
Hello, all,
In a side thread (Manik Surtani shared the link to this discussion with co-workers the other week, and we had our own discussion), a couple of other options were posited for high throughput map data structures that don't experience a pause due to re-hashing: hash-array mapped tries and linear hashing.

Hash-array mapped tries use a tree structure to avoid having to double an underlying array and re-hash. But they are much more cache-friendly than BSTs and have constant-time access (queries are O(k) where k is the number of bits in the hash code and unrelated to the size of the table). It also offers deterministic iteration order (sorted by hash code, insertion order breaks hash code collisions) without having to keep an orthogonal linked list of entries (though iteration is much slower since it's more complicated -- it's a tree structure with no back-pointers to parent nodes, so requires the iterator to maintain a stack).

Linear hashing is more like a traditional hash table, but uses a pretty clever trick to grow incrementally (one bin at a time), so that increasing the size of the map only needs to re-hash one bin, not the whole table. This table must still be backed by a growable array. The typical growable array would be something like ArrayList, which must double and copy periodically. Though significantly faster than re-hashing the whole table, it would still cause a noticeable latency spike when a very large table needed its underlying array to be doubled. There exists growable array structures that can grow by one element in constant time and also provide constant time random access, which would be more suitable.

I implemented both, HamtMap and LinearHashingMap, and then compared their performance and throughput against standard JRE maps: HashMap, LinkedHashMap, and TreeMap.

As might be expected, HashMap and LinkedHashMap were the kings of throughput, and TreeMap was the dog of the group. These two new implementations are actually reasonably close in performance to the top two, especially LinearHashingMap which ends up having the better throughput (though I think there is a little more room for optimization in the HamtMap implementation). Also as expected, despite having the best average speed for insertions, HashMap and LinkedHashMap also had the slowest worst case, due to the re-hashing. None of the other three implementations had as much variation in put operations. Also as might be expected, HamtMap had slightly better worst-case get operations, no doubt since two objects with different hash codes can never hash to the same bin so navigating a linked list to find a key in a bin doesn't happen as often (actually, for the performance test I wrote, it would never happen since it uses numeric keys, which only collide on hash code for values > 2^32).

For your consideration, here are the implementations:

And the harness I used to play around with benchmarking these implementations:


Vladimir Sitnikov

unread,
Apr 14, 2014, 12:35:26 AM4/14/14
to mechanica...@googlegroups.com

Can you please use JMH as a benchmark harness?

Hint: JMH is much likely to measure the right thing than other harness.
Hint2: when using JMH, Alexey validates the benchmark for free. That is what I see.
Hint3: when you hand-roll a benchmark harness God kills a kitten.

Vladimir

ymo

unread,
Jun 4, 2014, 8:00:10 AM6/4/14
to mechanica...@googlegroups.com
Hi Gil.

I tried to play with this Pauseless HashMap and its barking at me (. It looks like there is no way to cleanly stop those 2 threads running in the background and maybe JMH does not like that very well ???

jmh based benchmark http://pastebin.com/NUN4vDet

If i replace line 34 with java.util.HashMap everything works fine.

Thanks.

Gil Tene

unread,
Jun 15, 2014, 11:30:40 AM6/15/14
to mechanica...@googlegroups.com
The backgroundResizers threads were not daemons, so they keep the JVM alive unless it explicitly exits (which I guess junit does but jmh doesn't). I now made backgroundResizers into daemons with change 4b57b502d0 . Can you try it again?
Reply all
Reply to author
Forward
0 new messages