I just finished a first cut at LatencyUtils (which includes the LatencyStats class). LatencyStats is my attempt at a clean API for a magic, auto-coordinated-omission-correcting thing that can be dropped into the classic (and broken) way most in-process monitoring of latency is done. You can find it at
http://latencyutils.github.io/LatencyUtils/
The problem LatencyStats is meant to solve is the following:
When processes internally measure and collect stats on operation latency, the code usually looks like some variant of the following:
long startTime = System.nanoTime();
doTheThingWeWantToMeasureLatencyFor();
statsBucket.recordLatency(System.nanoTime() - startTime);
There are two big issues with the above pattern:
1. When a large stall occurs during the measured operation, the resulting large latency will be recorded only once, ignoring the fact that other operations are likely pending outside of the timing window. As a result, the stats that will come out of the statsBucket will be highly skewed. This is classic coordinated omission
2. If a stall occurs outside of the timing window, even the single large latency will not be recorded.
While #1 can potentially be corrected, e.g. a stats bucket that uses HdrHistogram can correct for it given a known or estimated expected interval, #2 is harder to deal with, since missing the large result altogether leaves us with no indication that there is something to correct.
LatencyStats deal with these two problems simultaneously by employing a pause detector and an interval estimator. It uses both under the hood such that a latencyStats object can simply fit right into the code pattern above, and produce significantly more accurate/realistic histograms that account for stalls experienced by the system.
The code is fairly young (a week or so), and has not been beaten up much yet, so approach with caution. Feedback and flames are welcome.
Enjoy.