Percentile calculation

John Cohen

unread,

Jan 22, 2010, 8:25:44 PM1/22/10

to project-voldemort

Hi All,

I would like to calculate percentiles (very interested in 99.9). In
order to do that I'm thinking to add a "int[] latencies" inside
"Accumulator" class (only for PUT operations). Where index of this
array "latencies" is the "totalTimeNS" value that represents the time
in NS that a put operation took. (from StatTrackingStore ) The "value"
will be a counter that keeps track of the numbers of put's that took
the same amout of time , for example:

if 2 puts took the same amount of time (75 NS) each then the array
will have
latencies[75] = 2

I'll have to keep the latencies otherwise when the timeout (5 minutes)
arrays the Accumulator will be recreated.
I'll have to passed in the same way "total" does as in

<code>
private static class Accumulator {
...
...
public Accumulator newWithTotal() {
return new Accumulator(System.currentTimeMillis(), 0, 0,
total, latencies);
// return new Accumulator(System.currentTimeMillis(), 0,
0, total);
}

</code>

Any ideas? Is this the right approach?

Thanks you all,
John

Jay Kreps

unread,

Jan 25, 2010, 11:54:39 PM1/25/10

to project-...@googlegroups.com

Hey John,

This would be a very welcome feature. Your approach of storing the
histogram is good but storing an entry for each possible number of
nanoseconds up to a second would mean an array of 1,000,000,000 ints
which would require 3.8 GB of heap space per histogram! But that is
easily fixed, you just need to make the bin size configurable (default
of 1ms should be fine) and make the maximum configurable so that all
requests outside the range just go to a single bucket (say the last
bucket). Then to calculate the correct bin, rather than doing
bins[time] you would do bins[Math.round(time/binsize_in_ns)]. I doubt
anyone cares to be more accurate than 1ms, so that would be a
reasonable default bin size and would lead to a reasonably sized
array. The arrays could be reused to avoid continuously reallocating
them. We should definitely do it for all the request types GET, PUT,
GET_ALL, DELETE.

A slick implementation would correctly zero out any insignificant
digits (i.e. less than the precision of 1ms) and return positive
infinity if the quantile is found to be in the last bucket (the
overflow bucket).

-Jay

> --
> You received this message because you are subscribed to the Google Groups "project-voldemort" group.
> To post to this group, send email to project-...@googlegroups.com.
> To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/project-voldemort?hl=en.
>
>

Martin Traverso

unread,

Jan 26, 2010, 12:42:38 AM1/26/10

to project-...@googlegroups.com

I've used DescriptiveStatistics from Apache Commons Math for this kind of stuff before. It keeps track of the last N values and can compute stats such as max, min, avg, stddev and percentiles over the values in that window: http://commons.apache.org/math/apidocs/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html

If you go with the bin-based approach, I would suggest using variable-sized bins. The larger the value, the larger the bin (you can use log() to achieve this). That way you get the best of both worlds: the ability to discriminate between small values (0.1 ms vs 1 ms is arguably more interesting than 1000.1 ms vs 1001 ms) and much lower memory usage.

Martin

Jay Kreps

unread,

Jan 26, 2010, 1:01:44 AM1/26/10

to project-...@googlegroups.com

Ah, good idea about the variable length bins, one could precompute the
sizes to ensure a minimum of N significant digits which would be
fairly sane, this would allow for a very large upper bound (say 60
seconds instead of 1 second).

-Jay

John Cohen

unread,

Jan 27, 2010, 3:13:25 PM1/27/10

to project-voldemort

the problem that I see using the "Math.log()" approach is the
distribution on the transaction time being accumulated in very small
number of buckets (or cell in the array).
For example, let's say that you have a range from 0ns to 1 sec. (using
the log to calculate the index of the cell) then you will have a max.
of 21 cells

index - [Math.log(timeNS) (i have put only the starting range
number for simplicity)
1 - [2.7..
2 - [7.29
3 - [19.68
4 -[53,14..
5 - [143.48
6 - [387.42
7 - [1046.03
8 - [2824
9 - [7625
10 - [20589
11 - [55590
12 - [150094
13 - [405255
14 - [1094189
15 - [2954312
16 - [7976644
17 - [21536939
18 - [58149737
19 - [157004289
20 - [423911582
21 - [1144561273

When calculating the percentile I'll lose accuracy. For example,
index 13 and 14 they have a big gap I won't know if I'm looking at
transaction that took 0.4 or 1.0 ms.
I think that we can have 2 dimensional array. I'll code this and
expose to you all for review.

Thanks,
John Cohen

Rob Adams

unread,

Jan 27, 2010, 4:50:53 PM1/27/10

to project-...@googlegroups.com

You can use a smaller base in your logarithm, or use a piecewise function.

John Cohen

unread,

Jan 27, 2010, 6:21:31 PM1/27/10

to project-voldemort

I have looked this in different directions, I have considered
performance and simplicity. I would like a second opinion.

This is the easiest approach:

<pre>
public class StatTrackingStore<K, V> extends DelegatingStore<K, V> {
...
...
@JmxGetter(name = "writeLatencyInMs", description = "Write
Latencies for: 50, 90, 99, 99.5 and 99.9 percentile")
public String getWriteLatencyInMs() {
return stats.getLatencyStats(Tracked.PUT);
}
}

public class RequestCounter {

/**
* Used to divide the TimeNS that a transaction took. The result
will be
* used as index into an array that keeps track of the total
number of tx.
* For example a value of:
* 34000 ns represents index=3 (int)(34000/10^4)
* 38000 ns represents index=3 (int)(38000/10^4)
* 375000 ns represents index=37 (int)(375000/10^4)
*
* Basically this will knock down 5 digits from time NS.
* I doubt anyone cares to be more accurate than 1ms, so that
would be a
* reasonable default.
*
*/
private final static int DISTRIBUTION_FACTOR = 100000;

/**
* Normalize the index back to millis.
*/
private final static float NORMILIZE_TO_MILLIS =
DISTRIBUTION_FACTOR / 1000000.0F;

public void addRequest(long timeNS) {

int index = (int) (timeNS / Accumulator.MAX_ARRAY);

for(int i = 0; i < 3; i++) {
Accumulator oldv = getValidAccumulator();

long startTimeMS = oldv.startTimeMS;
long count = oldv.count + 1;
long totalTimeNS = oldv.totalTimeNS + timeNS;
long total = oldv.total + 1;
int[] latencies = oldv.latencies;

// Updates the counter that keeps track of the number of
// transactions that took the same amount of time.
//
//
// If the index is bigger than Accumulator.MAX_ARRAY it
means that the
// transaction took such a big time that can be considered
a rare case,
// For example, let's consider:
//
// timeNS = 10.000.000.000 (10 seconds)
// index = timeNS / Accumulator.MAX_ARRAY
// index = 10.000.000.000 / 100.000
// index = 100.000
if(index < Accumulator.MAX_ARRAY) {
latencies[index] = getSafeNextCounter(latencies
[index]);
}else{
// Put this value in the last bucket.
latencies[Accumulator.MAX_ARRAY] = getSafeNextCounter
(latencies[Accumulator.MAX_ARRAY]);
}
//System.out.println("timeNS:" + timeNS + ", latencies["
+index + "]:" + latencies[index]);

if(values.compareAndSet(oldv, new Accumulator(startTimeMS,
count, totalTimeNS, total, latencies))) {
return;
}
}
}

private int getSafeNextCounter(int value) {
// reset the counter if maximum value is reached, unlikely to
// happen...
int nextValue = value+1;
if(nextValue < 2147483640)
return nextValue;
else
return 0;
}

private int findIndexByPercentage(final float percentage, final
long totalTx, int[] latencies) {
float countTx = (totalTx * percentage);
long found = 0;
for(int index = 0; index < Accumulator.MAX_ARRAY; index++) {
found = latencies[index] + found;
if(found > countTx) {
return index;
}
}

return 0;
}

public String getLatency() {
int[] latencies = getValidAccumulator().latencies;
long totalTx = 0;
for(int val: latencies) {
totalTx = val + totalTx;
}

int index50 = findIndexByPercentage(0.50F, totalTx,
latencies);
int index90 = findIndexByPercentage(0.90F, totalTx,
latencies);
int index99 = findIndexByPercentage(0.99F, totalTx,
latencies);
int index995 = findIndexByPercentage(0.995F, totalTx,
latencies);
int index999 = findIndexByPercentage(0.999F, totalTx,
latencies);

// Calculate the percentiles in Millis.
float p50Ms = index50 * NORMILIZE_TO_MILLIS;
float p90MS = index90 * NORMILIZE_TO_MILLIS;
float p99MS = index99 * NORMILIZE_TO_MILLIS;
float p995MS = index995 * NORMILIZE_TO_MILLIS;
float p999MS = index999 * NORMILIZE_TO_MILLIS;

return "Latency: 50=" + String.format("%.2f", p50Ms) +
", 90=" + String.format("%.2f", p90MS) +
", 99=" + String.format("%.2f", p99MS) +
", 99.5=" + String.format("%.2f", p995MS) +
", 99.9=" + String.format("%.2f", p999MS);

}
...
...
...
}

private static class Accumulator {
public final static int MAX_ARRAY = 100000;
private final static int MAX_ARRAY_PLUS_ONE = MAX_ARRAY + 1;
final int[] latencies;

public Accumulator() {
this(System.currentTimeMillis(), 0, 0, 0, new int
[MAX_ARRAY_PLUS_ONE]);
}

...
...

</pre>

Welcome your comments,
Thanks,
John Cohen

Jay Kreps

unread,

Jan 28, 2010, 11:07:02 PM1/28/10

to project-...@googlegroups.com

Hey John,

Looks great. Could you send a patch, so we could apply it and play
with it a bit.

-Jay

John Cohen

unread,

Jan 28, 2010, 11:15:05 PM1/28/10

to project-...@googlegroups.com, jay....@gmail.com

It would be an honor to contribute to this project.
Just let me know how to build a patch, I'm working with SVN (svn diff ?),

John Cohen

unread,

Mar 11, 2010, 11:28:52 AM3/11/10

to project-voldemort

Hi Jay,

I have done modification in code so RequestCounter so it can send a
JMX notification when this instance is about the reset its stats.
counters.
This is a different approach but complementary of the current source
code. I like the notification based approach because an external
program simply will be listening for notifications event, instead of
the standalone program poll in a fix intervals. Also the notification
is accurate in this way because if the standalone program polls every
1 minutes I could be that the counters got reset in between this
minute so all the stats counters won't reflect what happened during
this minute:

"|" start and end of the 5 min. interval
"^" when the standalone program poll

|-----+-----+-----+-----+-----|------+
...............................^....^

On Jan 28, 11:07 pm, Jay Kreps <jay.kr...@gmail.com> wrote:
> HeyJohn,

>
> Looks great. Could you send a patch, so we could apply it and play
> with it a bit.
>
> -Jay
>

Bruce Ritchie

unread,

Mar 11, 2010, 1:37:52 PM3/11/10

to project-voldemort

Note that this patch might conflict with the change for making stats
be sliding window based (http://code.google.com/p/project-voldemort/
issues/detail?id=221). If they both look good we should likely roll
them up together when applying to trunk

Bruce

On Jan 28, 11:07 pm, Jay Kreps <jay.kr...@gmail.com> wrote:

John Cohen

unread,

Mar 11, 2010, 1:54:20 PM3/11/10

to project-...@googlegroups.com

Hi,

I wonder if the windows approach is really need it. If you have a notification based mechanism you don't have to keep a window at all.
You are getting the information out of Voldemort anyway using a jmx client application, the idea of the window is to "smooth" out the data. why to go throw the calculation of keeping a sliding window?
With a notification based approach you receive the information as soon as its about the be reset it.

Bruce Ritchie

unread,

Mar 11, 2010, 4:53:28 PM3/11/10

to project-voldemort

Jorhn,

It really depends on your use case. If your client/monitoring solution
is always connected to the voldemort server via jmx then the
notification may be sufficient. However, if your client only connects
occasionally then having stats that actually reflect a given period of
time (i.e. a sliding window) is actually very useful. In our case our
application only connects to show jmx stats to administrators on
demand. Given that usage scenario the current mechanism would actually
be harmful for us to display to administrators simply because it's
sure to elicit bug reports.

There are downsides to the sliding window approach though ... the main
one being that you have a choice between memory usage vs how many
operations you track.

Regards,

Bruce Ritchie

On Mar 11, 1:54 pm, John Cohen <john.java.w...@gmail.com> wrote:
> Hi,
>
> I wonder if the windows approach is really need it. If you have a
> notification based mechanism you don't have to keep a window at all.
> You are getting the information out of Voldemort anyway using a jmx client
> application, the idea of the window is to "smooth" out the data. why to go
> throw the calculation of keeping a sliding window?
> With a notification based approach you receive the information as soon as
> its about the be reset it.
>

> On Thu, Mar 11, 2010 at 1:37 PM, Bruce Ritchie <bruce.ritc...@gmail.com>wrote:
>
>
>
> > Note that this patch might conflict with the change for making stats
> > be sliding window based (http://code.google.com/p/project-voldemort/

> > issues/detail?id=221<http://code.google.com/p/project-voldemort/%0Aissues/detail?id=221>).

> > project-voldem...@googlegroups.com<project-voldemort%2Bunsubscr i...@googlegroups.com>

> > .
> > > > For more options, visit this group athttp://
> > groups.google.com/group/project-voldemort?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "project-voldemort" group.
> > To post to this group, send email to project-...@googlegroups.com.
> > To unsubscribe from this group, send email to

> > project-voldem...@googlegroups.com<project-voldemort%2Bunsubscr i...@googlegroups.com>

John Cohen

unread,

Mar 15, 2010, 9:15:47 AM3/15/10

to project-...@googlegroups.com

Hi Bruce,

The JMX client application is always running (also it's capable of handling reconnection automatically)
The main purposes of this jmx-client application is to log the stats into a flat-file (i.e. log4j rolling appenders).
The programs that crush the number (for different purposes), consume the information found in these logs.
All the "moving" parts have a separated and unique concern from the point of view of the design.
I have also put a "watchdog" program checking for jmx-client, in case it stops working, jmx-client will be brought back to a running state.

-john

To unsubscribe from this group, send email to project-voldem...@googlegroups.com.

Reply all

Reply to author

Forward