jmh benchmark for (pauseless) hashmap

536 views
Skip to first unread message

ymo

unread,
Jun 23, 2014, 8:48:03 PM6/23/14
to mechanica...@googlegroups.com

Unless i am doing something wrong with the bechmark (which is very likely) the difference in latency is only when you go above the 99.9 percentile ... so only in worst case scenarios which was expected. 






ymo

unread,
Jun 23, 2014, 8:58:31 PM6/23/14
to mechanica...@googlegroups.com
Forgot to ask ... 

1) ami doing something wrong in how i did the test ?
2) Is java.util.hashmap supposed to be better in the get case ?

p.s.
all units are in us/op

Benedict Elliott Smith

unread,
Jun 24, 2014, 5:05:28 AM6/24/14
to mechanica...@googlegroups.com

You're presizing your hash tables by performing 10% of the inserts in the up method, so you're not benchmarking much of the 'pauselessness' - only 3 method calls in the java.util.HashMap test will involve a resize (10M inserts = ~16M capacity; *2^3 = ~128M, which can accommodate 100M elements), so it is nor surprising you only see it in the very long tail.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ymo

unread,
Jun 26, 2014, 2:27:27 PM6/26/14
to mechanica...@googlegroups.com
Thank you benedict for looking over the test

I redid the test and now i am not presizing the map. What i found is that :

1) The differences only start at 99.9 percentile
2) Not very big difference
3) in the put case where the resizinfg is exercised the java util.map wins but again not by much.

so what gives ? Anything else i am doing wrong here ?

p.s.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Georges Gomes

unread,
Jun 26, 2014, 2:36:00 PM6/26/14
to mechanical-sympathy

Hi

If the number of entries are large then yes only the higher percentiles are affected but the later the long and it could jump into the second range for resizing. Quite a long pause for some processes where latency is critical.

The other aspect to take into account is the fact that if you have multiple maps, you multiply the occurence of these pauses. Impacting a bigger percentile.

My 2 cents
GG

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Jun 27, 2014, 10:14:48 PM6/27/14
to mechanica...@googlegroups.com
My guess is that your benchmark doesn't do any resizes at all during the iteration, and only the first iteration increases the hash map size. Subsequent runs probably just replace entries in place.

This is based on a few observations:

1. There is no way that a blocking resize on a 10M entry hash map can complete in 10s of usec.

2. The first iteration of each of you java.util.map runs shows a large max latency (>10msec), but none of the rest do.

3. The first iteration of the pauseless hash map run shows a slightly larger max then the rest (120usec), but that max is 100x smaller than the java.util.hashmap case.

E.g.:
PauselessHashMap:
# Warmup Iteration 1: n = 21133, mean = 1 us/op, p{0.00, 0.50, 0.90, 0.95, 0.99, 0.999, 0.9999, 1.00} = 0, 1, 1, 1, 4, 21, 109, 120 us/op
...
and:
# Warmup Iteration 1: n = 33701, mean = 1 us/op, p{0.00, 0.50, 0.90, 0.95, 0.99, 0.999, 0.9999, 1.00} = 0, 0, 1, 1, 2, 4, 61, 74 us/op
...

java.util.hashmap:
# Warmup Iteration 1: n = 26027, mean = 1 us/op, p{0.00, 0.50, 0.90, 0.95, 0.99, 0.999, 0.9999, 1.00} = 0, 1, 1, 1, 3, 17, 71, 11534 us/op
...
and:
# Warmup Iteration 1: n = 31790, mean = 1 us/op, p{0.00, 0.50, 0.90, 0.95, 0.99, 0.999, 0.9999, 1.00} = 0, 0, 1, 1, 5, 17, 61, 12272 us/op
...

You need to figure out how to make each benchmark iteration encounter a resize of significant size (10M entries).

ymo

unread,
Jun 30, 2014, 8:29:12 AM6/30/14
to mechanica...@googlegroups.com
Hi Gil.

I had to fix the test because my map was getting filled before i started the actual measurement. I still think its a valid case specially if you know how many items you will have as a minimum. However as i said i modified the tests so that it starts with an empty map and then keeps putting items in it until the test finishes. I also ran multiple nobrOfItems from 10000 to 10Mil to see how wach one behaved. On the read side i have to populate before i can do a get so i tried to fill the map before doing 10million sequential  gets.

I also checked the repetitions on my random long generator and it seems there is none until 19Million at least !

Here is a chart showing the results ... http://jsfiddle.net/PU45W/2/embedded/result/

p.s.
the actual data is embeded in the html.i f you click on edit you can take a look .

ymo

unread,
Jun 30, 2014, 4:06:35 PM6/30/14
to mechanica...@googlegroups.com
Just to clarify All the units are micro seconds. The test ran on a debian machine with 4G ram and 2 cpus. The cpus were not isolated for this test. The Jdk that was used was jdk1.7.0_60.64 

If you only concentrate on the If you concentrate on onlyOnePut results which basically means adding items to an empty map until the required number of times you will see that :

1) java.util.hashmap is doing better on the 100k and 1mil cases:
2) on the other cases the latency on 99.9999 goes through the roof .. in a big way !! but all the percentiles low that are still ok.

Is this what i am supposed to observe ?

Gil Tene

unread,
Jun 30, 2014, 8:51:53 PM6/30/14
to mechanica...@googlegroups.com
There are few things to note here:

1. In your tests, the ONLY metric that I expect pauseless hash map to do better on is the worst case. When you omit that from the data sets, you are throwing away the main (and only) reason for pauseless hash map's existence.

2. Percentiles computed based on the timing of successful serialized results in a coordinated test (time op1, and when it is done, time op2, ...) are meaningless unless you adjust them for the expected throughput (and resulting expected interval between requests in you actual system).

To elaborate on #2, let me use numbers from your previous posting. In that posting, you demonstrated a max latency for a java.util.HashMap based OnlyOneGet of ~12msec with an average latency of ~88nsec. 

The percentiles you compute for the test would be valid for throughputs below about 83 operations per second (1000msec/12msec), but above that, the percentiles lie because the test harness is coordinating with the serially measured operation. How much the lie by depends on the throughput that the system will be used at. For the rate at which you ran the test (~11.36M hashmap ops/sec), each of those max times stalled would have also stalled ~136.4K other operations, non of which were measured by the test harness, but every one of which would see an increased latency in a real world system that actually tried to deal with this sort of arrival rate. This means that all your percentile reports (for a system running at this throughput) are off by roughly 5 order of magnitude. (i.e. your 99%'lie will be similar to the level you measured as 99.99999%'lie).

But 11.36M ops/sec is probably higher that reality, For a system running at a more reasonable rate, e.g. 100K or 1M hashmap ops/sec., the correction wouldn't be as big. Your number would "only" be off by 3 or 4 orders of magnitude ;-).

To be preemptively fair to JMH: to my knowledge nobody in the JMH sphere of influence ever claimed that the test harness is uncoordinated, or the the percentile numbers reported can be used to deduce anything about latency behavior (percentile or otherwise) in the real world. To do anything like that (provide %'lie behaviors for latency on the test at a given throughout), the test harness would need to actually execute the test a given throughput, and measure latency behavior at that throughput (adjusting for any "late" injection by counting the time waiting for injection as part of the measured latency). You wouldn't even need HdrHistogram for this (although it would be useful for reporting the percentiles with good dynamic accuracy without having to retain all that data for sorting at the end). I'd love to see a JMH version that does that (hint hint). Aleksey had mentioned the possibility before on this group, here.

ymo

unread,
Jul 1, 2014, 2:58:04 AM7/1/14
to mechanica...@googlegroups.com
Thank you Gil for shattering all my beliefs in JMH ... i was candidly thinking that this would be something already taken care by the framework ((

Now the alternatives in front of me are :
1) wait until this is properly implemented in JMH.
2) make something based on latencyutils and hope that i dont reinvent stuff that are already solved in JMH.

all bleak if you ask me ((( I will try to use latencyutils and report back to make sure i am not doing something wrong again.

p.s
i remember that thread since i started it ;)

ymo

unread,
Jul 2, 2014, 3:54:32 AM7/2/14
to mechanica...@googlegroups.com
Hi Gil. One more question ... )

I was pointed to this link* (by none else than Mr Shipilev) on the JMH list before i posted my last results and it sinking in ... somewhat !  let us say that i am putting the call to the put function in a loop that is bracketed by calls to System.nanotime. The idea is to pass this time to an HdrHistogram to compensate for coordinated omission. Now what would you call a good number of times to run this loop ? Should it be zero meaning time every call or 2 , 4, 8 ,16 times ??? The point is to make sure you are not hitting the granularity of Sytem.nanotime and still get meaningful results.

Regards.



On Monday, June 23, 2014 8:48:03 PM UTC-4, ymo wrote:

Gil Tene

unread,
Jul 2, 2014, 11:14:43 AM7/2/14
to <mechanical-sympathy@googlegroups.com>
The only way to study the latency behavior of an operation  (as opposed to the throughput across an aggregate number of operations) is to measure each and every operation individually.

E.g. It would be valid to measure latency across 8 HashMap operations if the question you are asking is "what does the latency of making 8 consecutive HashMap operations behave like"? But the answer to that question lacks some key information for answering "what does the latency of an individual HashMap operation behave like?" In the exactly the same mathematical sense that only measuring across 1,000,000 operations would. Your notion of "max" and of the higher percentiles will "only" be one order of magnitude off instead of 6, but it's still misrepresenting the useful information, especially since the only purpose of a Pauseless HashMap is to avoid or minimize outliers.

So yes, you'll have to live with the granularity problems of measuring time. The good news is that they will effect your estimates of the lower percentiles, but have no significant impact on the higher ones. You can still use a separate throughout test to establish the common case speed if the operation at hand without time-measurement overhead.

If you use HdrHistogram for studying the latency behavior, you will still need to decide what expected interval to tell it to correct for. An acceptable approximation is to use (1/throughput) as an e one tend interval time in an all-out run. Another thing you may want to do is to collect one uncorrected histogram, and then report multiple different percentile scenarios, one per expected interval (1/expected_throughput), by using copyCorrectedForCoordinatedOmission to create a corrected histogram for each reporting scenario. This has the added benefit of moving the correction logic cost outside of your measurement windows (which doesn't matter much with multi-usec measurements, but dies when they are this short).

And yes, the real world percentiles should be different at different throughputs fir the same outlier behavior.

Sent from my iPad
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/g-iCw1HbZ-o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Aleksey Shipilev

unread,
Jul 2, 2014, 11:42:21 AM7/2/14
to mechanica...@googlegroups.com
On 07/01/2014 04:51 AM, Gil Tene wrote:
> To be preemptively fair to JMH: to my knowledge nobody in the JMH sphere
> of influence ever claimed that the test harness is uncoordinated, or the
> the percentile numbers reported can be used to deduce anything about
> latency behavior (percentile or otherwise) in the real world.

That's true. JMH does, however, show better picture on beefier
workloads, mostly because the percentile spread is lower, and
1/throughput is closer to max latency.

> To do
> anything like that (provide %'lie behaviors for latency on the test at a
> given throughout), the test harness would need to actually execute the
> test a given throughput, and measure latency behavior at that throughput
> (adjusting for any "late" injection by counting the time waiting for
> injection as part of the measured latency). [...] Aleksey had mentioned the possibility before on this
> group, here
> <https://groups.google.com/d/msg/mechanical-sympathy/yfClI3yLOdY/rgKIRQiBKXIJ>.

Yeah, and speaking from the experience of (half)implementing this in
SPECjbb2013, I'm not sure it would be done in next few months. Drafts
and patches in JMH are welcome, of course. :) Ping me if you want more
gory details where to start hacking from.

Assuming the driver threads are the worker threads, e.g. the benchmark
now turns into something with adaptive backoff (think-time) before each
operation, we would still need to solve quite a few nasty problems:

a) Busy-waiting or blocking during think-time? Both are bad:
- busy-waiting burns the cycles in a pattern very different from the
measured code; and also requires calibration if you want throughput
measured in absolute time units;
- blocking has the lower limit on the duration you can block for: the
best I vaguely remember doing is tens of usecs with timed
LockSupport.park. And yes, that's platform-specific.

b) Maintaining constant throughput would mean maintaining the average
time between two consecutive calls into the operation. That would mean
measuring time spent in each operation to figure out how much to back
off now. In SampleTime, we try to balance the timer scalability costs
with lower sampling frequency, but in this new mode, we would have to
measure each operation. The rule of thumb for me is 50K samples per sec
is the higher bound on this. Therefore, the lower limit for system-wide
throughput in this scenario is 50K/sec. This is very far from the
maximum throughput the system can handle without locking the throughput
in -- some clear guidance for benchmarks would be needed.

c) Quickly sampling normal, Poisson, or exponential distributions to
get the current think-time would take some creativity. At the rate of
50K/sec we would need to have a random value each 20us. If we want to
keep PRNGs below 1% in overall profile, that translates to 200ns per
random sample -- on the border of possibility with iterative and/or
transcendental generators. Note you can't pre-cache PRNG sequences
because short-periodic PRNGs will introduce resonant oscillations in the
total submission rate.

d) The submission-rate-maintaining code is usually very fragile, and
generally untestable because you have only a vague idea of all the
timing shenanigans the infrastructure can experience.

I don't think these problems are unsolvable, but I do know they require
*A LOT* of time to get right. I may be wrong. I sincerely hope I am wrong.

Thanks,
-Aleksey.

ymo

unread,
Jul 2, 2014, 1:17:11 PM7/2/14
to mechanica...@googlegroups.com
Can i ask how is LatencyUtils able to make (semi) deterministic throughput rates ? Is it accurate enough or it just does not matter ?

Gil Tene

unread,
Jul 2, 2014, 1:25:49 PM7/2/14
to mechanica...@googlegroups.com
Aleksey, I wish I had time to hack on JMH for this. But maybe someone else can. I was thinking of a simpler starting point that hopefully circumvents most of nasty problems you mention: A "maintain the requested throughput throughout the test" mode, rather than a "do each request at the exact time it was supposed to one". Most constant-throughput load generators do this without doing any think time estimation, and the math for that is quite simple, it's just too bad they forget to do the right math on latency reporting. Here is the simple pseudo-logic with both correct and "incorrect" (coordinated) latency measurement being done in a sustained-throughput test loop (the coordinated latency logic is included mostly to show what most constant throughput load testing tools I see seem to do):

        long expectedStartTimeNsecDeltaPerOp = 1000000000L / requestedOpsPerSec;
       
long expectedStartTimeNsec = System.nanoTime();

       
while (testNotComplete()) {
           
long actualStartTimeNsec;

           
while (expectedStartTimeNsec <= (actualStartTimeNsec = System.nanoTime())) {
                doOperation
();
               
long measuredEndTimeNsec = System.nanoTime();

               
long coordinatedOperaionLatencyNsec = measuredEndTimeNsec - actualStartTimeNsec;
               
long operationLatencyNsec = measuredEndTimeNsec - expectedStartTimeNsec;

                coordinatedLatencyHistogram
.recordValue(coordinatedOperaionLatencyNsec);
                latencyHistogram
.recordValue(operationLatencyNsec);

                expectedStartTimeNsec
+= expectedStartTimeNsecDeltaPerOp;
           
}

           
TimeUnit.NANOSECONDS.sleep(someSmallIntervalNsec); // Optional. Can do nothing here spin instead.
       
}

With this logic, latency is measured from when an operation was supposed to start to the time it actually completed. Any time delay in starting the operation (whether it is due to a blocking on a long operation or not) is counted as operation latency, correcting for coordinated omission with no need for post-correction or estimation of intervals.

A somewhat more sophisticate version would avoid "blaming" the operations for the test harness sleep time (but does "blame" it for it's own time holding back the test harness):

        long expectedStartTimeNsecDeltaPerOp = 1000000000L / requestedOpsPerSec;
       
long testStartTime = System.nanoTime();

       
long elapsedTime = 0;

       
while (testNotComplete()) {
           
long expectedStartTimeNsec = System.nanoTime(); // Avoid counting sleep time against ops.

            elapsedTime
= expectedStartTimeNsec - testStartTime;

           
while (nOps < requestedOpsPerSec * elapsedTime) {
               
long actualStartTimeNsec = System.nanoTime();
                doOperation
();
               
long measuredEndTimeNsec = System.nanoTime();

               
long coordinatedOperaionLatencyNsec = measuredEndTimeNsec - actualStartTimeNsec;
               
long operationLatencyNsec = measuredEndTimeNsec - expectedStartTimeNsec;

                coordinatedLatencyHistogram
.recordValue(coordinatedOperaionLatencyNsec);
                latencyHistogram
.recordValue(operationLatencyNsec);

                expectedStartTimeNsec
+= expectedStartTimeNsecDeltaPerOp;
                expectedStartTimeNsec
= Math.min(expectedStartTimeNsecDeltaPerOp, measuredEndTimeNsec); // needed to avoid accumulating gifts

                elapsedTime
= measuredEndTimeNsec - testStartTime;
                nOps
++;
           
}

           
TimeUnit.NANOSECONDS.sleep(someSmallIntervalNsec); // Optional. Can spin instead.
       
}

[Note: this is pseudo logic meant for demonstrating the concept. It is not tested code and may fail a reality test].

Gil Tene

unread,
Jul 2, 2014, 3:48:54 PM7/2/14
to <mechanical-sympathy@googlegroups.com>
LatencyUtils is meant for measuring situations where no "known stable rate" exists, which makes it useful for real world measurement, and for measurements that are orthogonal to so e unknown or unrelated test harness. But LatencyUtils uses / depends on a pause detector, and its default pause detector only considers things that stall the whole process to be a pause. You can plug in your own detector, but you'd need to determine what a pause is. In the case of a benchmark harness with some requested execution throughput, if you know what a pause is (any situation where you fall behind the required request count for the elapsed time), you can precisely correct for it with logic similar to my other posting, which will be better than the LatencyUtil estimators.

Sent from my iPad
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/g-iCw1HbZ-o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Alex Bagehot

unread,
Jul 6, 2014, 6:13:15 AM7/6/14
to mechanica...@googlegroups.com

Hi, A minor note Poisson processes have exponential arrival time Distribution. There's some discussion here of candidate inter arrival time distributions. http://perfdynamics.blogspot.co.uk/2010/05/load-testing-think-time-distributions.html?m=1

Given I have an interest in open workload simulations I would be interested if you could send the gory details of where to start hacking from?
Like most people my free time is variable but I would have thought Gil's method should be achievable to start with.

Presumably there is also a difference between one thread driving the load with calculating adjustments, and requests overlapping where the mean latency starts to exceed the mean inter arrival time? Open workloads can have worse latency due to the lack of back off. Not thinking about whether it is possible to implement.

Thanks
Alex

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Aleksey Shipilev

unread,
Jul 6, 2014, 9:10:43 AM7/6/14
to mechanica...@googlegroups.com
Hi Alex,

On 07/06/2014 02:13 PM, Alex Bagehot wrote:
> Given I have an interest in open workload simulations I would be
> interested if you could send the gory details of where to start hacking
> from?

Sure, check out the current source here (follow "Bleeding Edge"):
http://openjdk.java.net/projects/code-tools/jmh/

Patch the source with this, and fill in the blanks:
http://cr.openjdk.java.net/~shade/jmh/stub-throughput-latency.patch

JMH contributions are governed by OpenJDK rules, which means we need a
signed OCA to look into your code (do submit this early, there is a
significant lag in OCA acceptance process):
http://openjdk.java.net/contribute/

If you have problems/questions, shoot them at jmh-dev@:
http://mail.openjdk.java.net/mailman/listinfo/jmh-dev

> Like most people my free time is variable but I would have thought Gil's
> method should be achievable to start with.

Good, thank you!

> Presumably there is also a difference between one thread driving the
> load with calculating adjustments, and requests overlapping where the
> mean latency starts to exceed the mean inter arrival time? Open
> workloads can have worse latency due to the lack of back off. Not
> thinking about whether it is possible to implement.

Well, we should support multiple threads sharing the throughput, and
assume the threads are not symmetric. This complicates the throughput
budget calculation logic a bit, but it still can be done scalably.

-Aleksey.

Reply all
Reply to author
Forward
0 new messages