New version of HdrHistogram on Github.

Gil Tene

unread,

Dec 24, 2012, 3:46:41 AM12/24/12

to mechanica...@googlegroups.com

I just finished pushing a new version of HdrHistogram to Github. After various tweaks and optimizations, it's constant-time value recording call clocks in at 3-6 nsec (!!) on modern x86 CPUs. Obviously, this is for hot, cache-hitting recording calls. Colder recording calls (where work between recordings thrashes cache) can expect to see 1-3 cache misses per recording. Either way. this is now plenty good enough for all sorts of ultra-high speed value stats collection at high fidelity, and without loss of accuracy.

http://giltene.github.com/HdrHistogram/

Wojciech Kudla

unread,

Dec 24, 2012, 5:46:38 AM12/24/12

to mechanica...@googlegroups.com

Gil,

Thank you for sharing. That's a great piece of very useful software.

Some time ago I was working on measuring low latency applications and found that the act of measurement usually distorts the results.

In my case it was all about minor gc pauses and storing individual transaction times in a large array. As the array was allocated in the heap, individual gc times were heavily affected by the fact tenured space was mainly polluted with live data from the measurement.

My solution to the problem was moving the array to an off-heap area.

Do you think it would make sense to do something similar with your implementation?

Gil Tene

unread,

Dec 24, 2012, 12:33:57 PM12/24/12

to

No need for off heap storage. Using off heap storage would not reduce any pressure in the case, and compared to the on-heap implementation it would slow the hot code down by adding both an additional dereferencing step and a necessary range check operation (with associated cache line access to length) that the JIT compilers in modern server-class JVMs are able to completely eliminate for on-heap arrays.

HdrHistogram records high fidelity value statistics (to an arbitrarily selected decimal point accuracy) for an unlimited number of samples in a fixed memory footprint. Once created (on heap), a Histogram does literally Zero allocation. That's one of the main goals in the whole design. I built it that way despite the fact that the Zing JVM (my baby) couldn't care less about GC pressure (since it is concurrent in both oldgen and newgen), mostly so hat people will be able to use it on any JVM, and in ultra-low-latency environments.

At first, I used Zero allocation because I wanted people to be able to use high fidelity response time and latency stats gathering on any and all JVMs. Good measurement technique and percentile reporting, especially in the presence of outliers caused by minor GC and such, always make us look good. But then the challenge of making the measurement code as short and fast as possible made me go farther and farther, cutting out storage hierarchies, dereferencing steps for the hot recording code paths, and making sure everything gets inlined properly and optimized away on good JITs. In this latest code iteration, for example, I took on the challenge of making the footprint of additive "incremental histograms" (with capped count level per increment) as small as possible so that I could practically transmit them over the wire in a future streaming version of jHiccup that I'm working on. That's what the smaller-count-type variations (ShortHistogram and IntHistogram) are for.

In the current implementation, when run on server-class JITs that do stuff like range check elimination in hot code, a recording of a value into the histogram accesses at most 2-3 cache lines (depending on whether or not the Histogram object's main fields span cache lines). Only one of those cache line accesses is "random" in the sense that its location depends on the recorded value, and even that one will typically hit in the cache in high frequency recording situations.

At 3 nanoseconds per recorded value and fixed time/space for practically infinite recording lengths, this code is already pretty tight, and a good example of putting mechanical sympathy to work (IMO) while maintaining clean semantics and functionality. But the same way some people like spending time polishing the paint on vintage cars, I keep looking for ways to make the hot path here shorter and shorter. Any comments and under-the-hood suggestions on making it even faster are welcome!

Timothy Chen

unread,

Dec 25, 2012, 12:54:57 PM12/25/12

to mechanica...@googlegroups.com

Hi Gil,

I wonder how did you measure your footprint and other numbers during your development?

Thanks,

Tim

Gil Tene

unread,

Dec 26, 2012, 2:12:29 AM12/26/12

to mechanica...@googlegroups.com

On Footprint: it's pretty straight forward. Especially once you know the basics of how JVMs lay out objects in memory. I currently use a very conservatively (erring on the large side) estimate of the total footprint of a Histogram at 512 bytes + the array elements involved in actual count storage. The array elements will tend to dominate this number in most use cases, and 512 bytes is an over-estimate of the fields and object header storage taken by the various parts of a Histogram.

Here is a back-of the envelope calculation example (you should probably glance at the code as you read this):

- HdrHistogram's current Histogram class has 11 fields in it, two of which are references to other objects (HistogramData and a counts array of longs). The non reference fields total 6 ints and 3 longs, and together with the two references the fields take up 64 bytes on 64 bit JVMs. The object header is either 8 or 16 bytes, depending on JVM used (e.g. Oracle HotSpot and OpenJDK use 16 byte headers. Zing uses 8 byte headers). So a total of 72-80 bytes.

- The counts array object header is either 24 or 16 bytes, depending on JVM. The space used by counts array elements themselves will vary by Histogram configuration (a formula is provided in the JavaDoc and comments).

- The current HistogramData object has two ints and three references (to the containing histogram, and to internally cached PercentileIterator and RecordedValuesIterator instances). Total of 32 bytes for fields, and with the header that's 40-48 bytes.

You get the picture by now. It's easy to see that PercentileIterator and RecordedValuesIterator will each fall under 128 bytes... The total footprint of all that is involved in a Histogram object (not counting the array elements storage) is well under 512 bytes.

On "other numbers":

- The Zero allocation part is mostly careful design (no new objects being created in either value recording or iteration loops). It is also validated by running and watching the heap via either MXBean or GC logs. Note that creating any sort of output usually does allocate objects (the various common ways to print stuff in Java do plenty of internal allocation). When trying to measure the allocation behavior of specific code, you need to be careful to make your observation and reporting code not false make you think the observed code is allocating more than it really does.

- The timing observation is measured. See perf.org.Histogram.HistogramPerfTest for a sample u-benchmark. Clocks at just over 3 nanoseconds per value recoding on My Ivy-bridge powered Macbook Pro laptop.

Figuring out what the JITs will do to the code in the hot paths:

- This one takes some experience, and is probably the harder one to "teach". Understanding what the JIT compiler and the runtime are actually capable of certainly influences how I would write Java code that I consider performance-critical. I try to only spend time on this in what I think of as "hot paths", leaving non-hot and non-performance-critical code alone in this respect (but good habits are always good to use). I also try hard to keep functionality and abstraction from being hurt by coding-for-optimization. The HdrHistogram implementation is a good example of this sort of exercise.

- The JITs get to do quite a bit of cool stuff. Some of it surprising, and it will sometimes get stuff eliminated that people may assumed to be "impossible". One of my favorite examples is how null checks can often be eliminated if the JVM handles of SEGVs caused by (untested) null references and knows how to throw the right exception at the right context (the one that the null check would have thrown if it were there).

- Looking at generated code that comes from the JVM's JIT code is very, very useful. I usually use Zing's build-in ZVision tool that lets me zoom into hot method generated code, and even see what the hot instructions are. But you can get also details of what the HotSpot generated code looks like with OpenJDK by using a debug build and turning on the -XX:+PrintOptoAssembly flag.

Jean-Philippe BEMPEL

unread,

Dec 26, 2012, 4:38:46 AM12/26/12

to mechanica...@googlegroups.com

Hello,

A better way to inspect JIT generated code it to use -XX:+PrintAssembly (see my blog post on it: http://jpbempel.blogspot.com/2012/10/how-to-print-dissasembly-from-jit-code.html & also this: https://wikis.oracle.com/display/HotSpotInternals/PrintAssembly)

Jean-Philippe BEMPEL

Timothy Chen

unread,

Dec 26, 2012, 6:18:36 PM12/26/12

to mechanica...@googlegroups.com

Hi Gil,

Thanks for the tips, reading the code and your comments it makes sense now why you got to 512.

I'm recently starting to write a high performance data analytics system and I don't have the insights of how JIT and runtime compiler internals work (just fairly high level).

I wonder how did one go about starting to learn about these things (even though it's hard as you mentioned)?

Thanks!

Tim

Gil Tene

unread,

Dec 26, 2012, 9:09:56 PM12/26/12

to mechanica...@googlegroups.com

> I wonder how did one go about starting to learn about these things (even though it's hard as you mentioned)?

Well, read a lot. And ask a lot of questions. And make lots of mistakes - those are the best ways for learning.

Here is a short reading list of useful terms to learn about. Google and read/understand, and follow what you find to other related stuff, and you'll probably learn a lot about what JITs can do to your code:

Inlining. Constant Propagation. Dead code elimination. Loop unrolling. Inline caching. Monomorphic and MegaMorphic. Uncommon Traps. Tiered compilation. Null check elimination. Bounds-check elimination. Lock coarsening. Escape analysis.

As a general starting point for guessing at what a JIT compiler could figure out and do to code, assume the compiler is really, really, really smart. Smarter than you. Smarter than you think is possible. I like to look at at the code and assume that whatever I can deduce myself and pre-compute (or eliminate) the compiler could also deduce and do the same with. If you can take a code sequence (across multiple methods and values), think about it hard, and simplify it to a short set of operations that will perform the same work, assume the compiler can do the same.

Then, look at the code the compiler actually generates, and try to figure out why it didn't do quite as well as you have hoped. It is far better to be disappointed by the compiler, and to address that disappointment by making it "easier" to detect efficiencies you want it to find, than to assume the compiler is stupid. Assuming the compiler is stupid wastes lots of time in "helping" it do stuff it can already do in it's sleep, and often makes ugly code ugly for no reason.

For an example of this sort of thinking in practice, I often use micro-benchmarking. Whenever I write a micro-benchmark loop and want it to make sure the compiler doesn't optimize the loop away, I make sure that:

1) The loop has side effects that cannot be eliminated (best way to do that is to output something based on an accumulated results of some sort).

2) That the side effects depend on the code I want to make sure run.

3) That the work is not being repeated on constant values.

4) That the side-effect behavior is complex enough to follow that it will not be practical to pre-derive.

And don't go thinking compilers don't try to figure out what you are doing. Compiler know how to compute series things, for example. Many compilers out there know are smart enough to replace the following code:

DoMyComputation(int i) { return i; }

...

sum = 0;

for (int i = 0; i < 10000; i++) {

sum += DoMyComputation(i);

}

System.out.println(sum);

with:

System.out.prinltn(49995000);

But most compilers today don't know how to do the same to:

sum = 0;

for (int i = 0; i < 10000; i++) {

sum += DoMyComputation(i % 79);

}

System.out.println(sum);

(Note that compilers could get smart enough to beat this too, they just aren't there yet in most cases. So this is "good enough" for current micro-benchmarking practices).

Peter Lawrey

unread,

Dec 27, 2012, 4:00:47 AM12/27/12

to Gil Tene, mechanica...@googlegroups.com

You have reminded me of an article I wrote showing that the JIT can determine which classes can be called at a given point in the code and optimise for those, even if other classes are called in the same loop.

http://vanillajava.blogspot.co.uk/2012/12/performance-of-inlined-virtual-method.html

--

Jean-Philippe BEMPEL

unread,

Dec 27, 2012, 4:54:19 AM12/27/12

to mechanica...@googlegroups.com

Hello Timothy,

There is a useful site where you can find a lot of information about HotSpot internals: https://wikis.oracle.com/display/HotSpotInternals/Home

I have also blogged on how to examine JIT code produced for

virtual calls: http://jpbempel.blogspot.com/2012/10/virtual-call-911.html
volatile: http://jpbempel.blogspot.com/2012/10/volatile.html

Hope you will find this links interesting.

Jean-Philippe BEMPEL

Wojciech Kudla

unread,

Dec 27, 2012, 5:11:19 AM12/27/12

to mechanica...@googlegroups.com

Gil,

Thank you for your answer. I probably didn't explain my solution in enough detail. By moving the array off-heap I meant completely replacing the array with off-heap storage such as allocating continuous area for samples using sun.misc.Unsafe. The same goes with storing and fetching individual values from that area. That way you don't deal with bounds checking or dereferencing step. Am I right?

Thanks,

Wojtek

Peter Lawrey

unread,

Dec 27, 2012, 5:40:41 AM12/27/12

to Wojciech Kudla, mechanica...@googlegroups.com

Hello Wojtek,

If you are considering using Unsafe allocated memory, I suggest you do this via an interface which also implements the same but using ByteBuffers. (It is not too tedious to implement as the method names are mostly the same) ByteBuffers can give you all the functionality you can get with Unsafe with additional checks (ByteBuffers are limited to a size of Integer.MAX_VALUE however). Once you are comfortable your program works correctly, the checks are not needed and you need the extra performance boost, you can switch in the Unsafe implementation.

I have done this in one of my libraries and found I only use the Unsafe version for benchmarks, I have never needed it in production ;) This is largely due the fact that you often max out the cache to main memory bridge with a few threads, even with bounds checking and if you grow large enough, the throughput of your disk subsystem becomes your limiting factor (which is when you get the fastest SSD you can ;)

Peter.

FYI There are the two implementations so you can compare them

https://github.com/peter-lawrey/Java-Chronicle/blob/master/src/main/java/vanilla/java/chronicle/impl/ByteBufferExcerpt.java

https://github.com/peter-lawrey/Java-Chronicle/blob/master/src/main/java/vanilla/java/chronicle/impl/UnsafeExcerpt.java

These access memory mapped byte buffers so the space used is only limited by the disk space and you get persistence, the later using the raw address().

Wojciech Kudla

unread,

Dec 27, 2012, 5:48:51 AM12/27/12

to mechanica...@googlegroups.com, Wojciech Kudla

Peter,

Thank you for your input. Of course I know your libraries :) They have always been a source of inspiration and knowledge for me. My first approach used ByteBuffers as you suggested but I really needed the additional boost of Unsafe intrinsics.

The issue here is the impact of having a large portion of old gen polluted with data structure that is not an integral part of the application.

When tuning for the shortest minor gc pauses (we don't experience major gc for weeks) I noticed that old gen occupancy and size affects minor gc times, hence my question to Gil about the impact of storing all the samples in the heap.

Thanks,

Wojtek

Peter Lawrey

unread,

Dec 27, 2012, 6:11:02 AM12/27/12

to Wojciech Kudla, mechanica...@googlegroups.com

In terms of what you can do with HotSpot, you are limited by how much you can put in the old gen without your GC time becoming too large. Azul has a concurrent collector which doesn't have the same limitations.

For the trading systems I work on we don't need 24h service. For this reason we can run a very large Eden space, large enough for a day of trading which is only cleaned up each night as a maintenance task (A full GC is triggered deliberately at say 5 AM)

The bulk of the data is stored off heap in ways which minimise GC produced in doing so, and have only notional impact on a GC.

When I use Unsafe in benchmarks I see a 30% performance improvement, mostly is reading a writing bytes, e.g. parsing text. When it comes to longs or doubles the difference is less than 5%.

IMHO, How much difference you will see in a real programs depends on how much text you are reading writing. If your application is sufficiently multi-threaded, you might find that CPU is not your bottleneck so reducing it doesn't help much.

--

Wojciech Kudla

unread,

Dec 27, 2012, 6:21:51 AM12/27/12

to mechanica...@googlegroups.com, Wojciech Kudla

That's the very same experience I had from working on low latency trading systems. Most of them can be freely maintained within windows that span throughout minutes or even hours as there are not too many markets open 24/7 or even 24/5.

I'll try to compare Gil's genuine heap-oriented Histogram with my Unsafe-based implementation in terms of minor gc impact and share the results.

Thanks,

Wojtek

Jean-Philippe BEMPEL

unread,

Dec 27, 2012, 8:01:32 AM12/27/12

to mechanica...@googlegroups.com

Hello Wojciech,

We have also considered off heap storage for our application too, for the same reasons: reducing minor GC time.
One of the major factor in minor GC duration is the card scanning. So if the old gen contains large portion of allocated memory your card scanning can be significant.
To reduce the quantity of old gen consumed we store some cache in off heap structure (basically a byte array, it is easier).
For info, the size of the Old is not related to time spend on card scanning, only the part of the old that is allocated. So you can set a size of 32GB and only consumed 2GB of your old with short minor GC.

Some info about card canning/marking:
http://www.ibm.com/developerworks/library/j-jtp11253/

As Peter Lawrey suggests, I strongly advise to test Zing JVM from Azul Systems which dramatically reduce the minor GC time, (for our application, down to 1-2 ms Max, but can be less depending on your app).
But I am sure that Gil will enjoy helping you on that ! ;-)

Jean-Philippe BEMPEL

Wojciech Kudla

unread,

Dec 27, 2012, 8:12:17 AM12/27/12

to mechanica...@googlegroups.com

Jean-Philippe,

Unfortunately, Zing is not an option for us at the moment even though it's superior to other JVM implementations in terms of low latency.

I managed to reduce minor gc pauses on Hotspot down to 600-700 micros which is a satisfactory result for my client (for my best performing trading app).

You are completely right that it's allocated space rather than the size of old gen that affects card scanning. But that still means a huge array in tenured may badly influence minor gc pauses. I hope Gil will find time to elaborate on this.

Wojtek

Gil Tene

unread,

Dec 27, 2012, 12:15:37 PM12/27/12

to mechanica...@googlegroups.com

Leaving aside generic questions about where to put dynamically allocated stuff (oldgen or off heap), and focusing on the specific case of HdrHistogram, the biggest step was achieving constant footprint, such that no dynamic allocation of any type was needed. Logging values does not allocate anything, ever. So all we are talking about is the size of the fixed space footprint and where it should be held.

The size of the internal counts array is a function of the highestTrackableValue and numberOfSignificantValueDigits settings chosen for a histogram. For example, the fixed size counts array needed to track response times in the dynamic range between 1 microsecond and 1 hour, maintaining a 3 decimal point resolution across the entire range, and using 64 bit (long) count values is "only" 184KB in size. The same dynamic range, maintaining only 2 decimal points of resolution will be 26KB. And the same thing using 16 bit counts (ShortHistogram) will be 6.5KB. [This last one is useful for incremental additive histograms, e.g. characterizing a 1 second interval of events that can only happen 1,000 times per second].

These are semi-trivial sizes when it comes to oldgen footprint, even for stop-the-world newgen collectors. Given that these are arrays of scalar value (no refs in them), the only real impact of holding this stuff in oldgen is the card scanning cost during newgen collections. On Oracle and OpenJDK Hotspot, each 512 bytes of heap is tracked with a 1 byte card value. That means the 184KB array in old gen takes an extra 368 bytes in the card table. Another way of thinking of it is that this will not raise the newgen pause times due to card scanning by a factor larger than 184KB/largest-size-of-oldgen-scanned-without-this-in-it, which is presumably a very small factor. On Zing, of course, this has no impact at all, since newgen is concurrent (including the card scanning part).

So, since the footprint is not "huge", the alternative of holding the same array in off-heap storage (or in direct mapped ByteBuffers) will simply not make any real difference in pause times on any JVM in this case. I doubt the difference will even be measurable.

And Wojtek, may I ask what makes Zing a non-option for you? Is your client running their low latency stuff on something other than Linux/x86?

Kirk Pepperdine

unread,

Dec 27, 2012, 12:25:44 PM12/27/12

to Gil Tene, mechanica...@googlegroups.com

Hi Gil,

The footprint question here doesn't even rise any where near the noise floor produced by even the memory quietest of applications.

Regards,

Kirk

--

Gil Tene

unread,

Dec 27, 2012, 12:28:38 PM12/27/12

to mechanica...@googlegroups.com

Some sort of bounds check thing is needed regardless of where you store things. HdrHistogram's code doesn't directly waste any effort on bounds checking. It's internal math is set up such that it will let an ArrayIndexOutOfBoundsException exception happen if you try to log a value larger than the configured highestTrackableValue, or for negative values, but it does not compute or check these boundaries directly.

In the on-heap, actual-java-array implementation, this leaves the actual bounds checking up to the Java runtime, and the runtime can often eliminate bounds checking through various optimizations. In the various off-heap and unsafe cases, your code will be responsible for performing bounds checks. If you don't, values greater than highestTrackableValue and negative values will result in heap corruption - not a good behavior choice for library.

-- Gil.

Kirk Pepperdine

unread,

Dec 27, 2012, 12:47:37 PM12/27/12

to Gil Tene, mechanica...@googlegroups.com

Which brings up another question, is moving stuff off-heap really all that useful?

-- Kirk

--

Wojciech Kudla

unread,

Dec 27, 2012, 1:43:03 PM12/27/12

to mechanica...@googlegroups.com, Gil Tene

Kirk,

My experience shows it is as long as you're dealing with large datasets. The footprint Gil's HdrHistogram introduces is too small to make any real difference.

Wojciech Kudla

unread,

Dec 27, 2012, 1:54:00 PM12/27/12

to mechanica...@googlegroups.com

Gil,

That's a clarification I was looking for. Thank you for detailed explanation backed up with some calculations. Given the numbers it's obvious now that moving samples off-heap wouldn't make any observable difference. My client is a big institution and as such suffers from inertia characteristic to large organisations. I heard two explanations for why not Zing:

1. Hotspot's been "good enough", so there's no point in bearing additional costs

or

2. This kind of change would require a large-scale evaluation and approval process which would take ages

Zing is a great piece of software and I still hope we'll get a chance to deploy it on production eventually. Will try to initiate a second take and see how it goes.

Thanks,

Wojtek

Timothy Chen

unread,

Dec 29, 2012, 3:14:20 AM12/29/12

to Jean-Philippe BEMPEL, mechanica...@googlegroups.com

Hi Jean/Gil,

Thanks for the links and info, I do understand high level what JIT can perform but not sure what specific optimizations each JVM performs, but as Gil said I do need to start experimenting to understand it more.

And the link info for HotSpot internals is definitely very useful!

Tim

--

Joachim Dorn

unread,

Jan 2, 2013, 8:54:26 AM1/2/13

to mechanica...@googlegroups.com, Jean-Philippe BEMPEL

Hi Tim,

I have not used it nor do I directly know anybody that did, but from the slides of the last JVM language summit, I was pretty impressed by Graal (http://wiki.jvmlangsummit.com/images/c/c4/Simon_Graal_Presentation.pdf) which offers you among other things a Java-API to HotSpot statistics like (copy & pasted from the slides) :

class ProfileDemo {
public static void main(String[] args) throws Exception {
    MetaAccessProvider meta = Graal.getRuntime().getCapability(MetaAccessProvider.class);
    Method m = String.class.getMethod("valueOf", Object.class);
    ResolvedJavaMethod rjm = meta.getResolvedJavaMethod(m);
    ProfilingInfo info = rjm.profilingInfo();
    System.out.println(MetaUtil.profileToString(info, rjm, "\n"));
}
}

$ mx vm -XX:-BootstrapGraal ProfileDemo
canBeStaticallyBound: true
invocationCount: 923
executionCount@1: 2600
branchProbability@1: 1.000
executionCount@6: 0
branchProbability@6: 0.000
executionCount@10: 2609
types@10:

Cheers

Kamil Demecki

unread,

Jan 20, 2013, 11:42:07 AM1/20/13

to mechanica...@googlegroups.com

Let me ask how do you normally gather histogram data? This kind of data has value on conditions close to production environment. To similar problems I've used three approaches

- deployment way with something like System.getProperty, use static final IS_HISTOGRAM with passed property and call histogram only if IS_HISTOGRAM is true

- deployment way with AspectJ, restart JVM with agent and aspect files

- programming way with add histogram on separate branch, configure CI for this branch, build and deploy

Regards

Kamil

Gil Tene

unread,

Jan 20, 2013, 12:59:24 PM1/20/13

to

If you are asking how I go about turning data collection and histogram logging code on/off for production run:

1. I am a big believer in collecting this sort of data in production. A production system with no monitoring information is a black box that is hard to manage, and HdrHistogram's accurate percentile reporting capability is very useful for monitoring how production environments actually perform. The cost is so low that logging every single operation/transaction/message in a histogram (and even steps within a single operation) is very practical.

2. If there are aspects of data collection or logging that you do with such density and cost that you want to avoid them in production, there is no need for separate code builds or code weaving tricks like using aspectJ. All you need to do is perform the work involved (the work you want to avoid in production mode) under a conditional block controlled by a class static final variable that depends on some startup parameter (command line, property, config file, etc.). As long as the condition is based on a class static final value, the JIT compilers will be able to propagate the constant and eliminate the condition (either including or excluding the code in the generated code that is actually run). The result is indistinguishable (from a performance point if view) from not having the code there at all. This is how assert() works...

Kamil Demecki

unread,

Jan 20, 2013, 1:48:36 PM1/20/13

to

Thanks for info. I agree and I'm aware of JIT optimizations for static final variables - my first approach is also basing on static final IS_HISTOGRAM.
I think that aspectJ case is useful to make quick measurement in test environment for places not needed being monitored in production - it is enough to restart JVM with agent and new method names in XML. However I'm not sure how aspectJ will impact JIT optimizations like inlining.

Regards

Kamil

Ariel Weisberg

unread,

Jan 26, 2013, 5:27:27 PM1/26/13

to mechanica...@googlegroups.com

Hi,

Thank you for contributing this.

I am trying this out right now to track execution time of stored procedures. Execution time is usually tens to hundreds of microseconds, but it can be minutes or hours at times.

Space utilization is a concern for me as I believe that in the worst case I will have to maintain 10s of thousands of histograms. I don't need much in the way of accuracy. I really just need to know what order of magnitude I am looking at for different percentiles. I am not clear on how to set numberOfSignificantDigits. The terminology value resolution and separation are new to me, and I don't know what the impact of different values will be.

I am also not clear on how to provide an appropriate value for expectedIntervalBetweenValueSamples. I know that I will have issues with outliers where a procedure can execute in a few hundred microseconds most of the time, but run for 10s of seconds when there is some kind of skew. However there are also procedures that are expected to execute in hundreds of millis or seconds so I can't hard code a one size fits all value.

Thanks,

Ariel

Gil Tene

unread,

Jan 27, 2013, 1:31:25 AM1/27/13

to

There are three questions in here: one on numberOfSignificantDigits, one on memory footprint, and one on expectedIntervalBetweenValueSamples:

On numberOfSignificantDigits:

numberOfSignificantDigits is [supposed to be] pretty straight forward. It establishes the resolution and accuracy of your tracked values (not the counts for those values, the value [ranges] themselves). You can think of it simplistically this way: A Histogram with 2 decimal points of value resolution will track counts on value windows that are within 1% of the values actually measured. The values it produces for the various percentiles, averages, and standard deviation calls will also be roughly within 1% of exact reality. A Histogram with 1 decimal point of value resolution will give values within 10%, and one with 3 decimal points of resolution will be within 0.1%.

To give you a feel for how this compares with either linear or logarithmic histograms (the most commonly used histogram forms out there): A logarithmic histogram provides values that are up to 50% off (assuming power of 2 buckets), and a linear histogram requires an enormous amount of storage if it wants to cover a good dynamic range. E.g. tracking values between 1 microsecond and 1 hour with at least 1% value accuracy across the range requires 1 microsecond buckets (otherwise accuracy will fall below 1% at the 100usec mark), which would mean 3.6Billion buckets...

On footprint:

You can get a pretty clear estimate of a Histogram's footprint using getEstimatedFootprintInBytes(). Assuming (from your comment below) that you want to track values between 1 microsecond and 24 hours, a Histogram that maintains 2 decimal points of value resolution across that entire range would be created with Histogram(24 * 3600L * 1000 * 1000, 2), and have a footprint of about 32KB. The same range covered with a single decimal point of value resolution would have a footprint of less than 5KB. Furthermore, these are the sizes of Histograms that use long values for counters (so overflows are unlikely to happen within a practical program's lifetime). If you know that your counts are more limited (e.g. less than 2B samples during your runtime), you can nearly half your footprint by using IntHistogram [I'd recommend against doing that unless you really worry about footprint, or are planning to move this stuff over a wire].

I'm not sure what sort of scenario would require you to track tens of thousands of histograms. Are there really tens of thousands of different stored procedures (each with multiple executions at varying time lengths) in your system? If so, that's ok, as tens of thousands of histograms, each with a ~+/- 1% value accuracy would still only translate to hundreds of megabytes of heap space. [And remember, this space remains static once allocated, and is composed of mostly scalar values, so it should not burden GC much at all].

On expectedIntervalBetweenValueSamples:

This mode of recording is optional. It is intended to help correct for Coordinated Omission. Beyond the explanations in the JavaDocs, you can see some discussion of the problem in my slide deck on "How Not To Measure Latency" (http://qconsf.com/dl/qcon-sanfran-2012/slides/GilTene_HowNotToMeasureLatency.pdf , coordinated omission is discussed starting in slide 27). Recording with expectedIntervalBetweenValueSamples is [VERY] useful when your measurements have a known pace or interval (e.g. when a load generator is known to be sending 1,000 requests per second, or an incoming tick or data stream of some sort has a known frequency). It is not useful or practical when your input has unpredictable intervals. For your situation, I suspect that you cannot predict the interval between values sample, and that you should just use the recordValue(long value)form.

You can se an example of using Histogram in a real world thing with my jHiccup code (also public domain under CC0, can be found at http://www.azulsystems.com/jHiccup). jHiccup uses a 2 decimal point value accuracy histogram tracking latency values between a microsecond and an hour in it's reporting of measured JVM "hiccup" behavior (basically any situation where the JVM was not executing for some amount of time, regardless of cause). jHiccup samples this behavior 1,000 times per second by sleeping for one msec and recording the length of time the sleep actually took. jHiccup makes good use of the recordValue(long value, long expectedIntervalBetweenValueSamples)form: It's expected interval is 1 msec, so if it sees a sample than took 350 msec, for example, that means it "missed" at least 348 samples that Histogram will auto-recreate given the expected interval.

Reply all

Reply to author

Forward