> On Mon, Oct 3, 2011 at 8:30 AM, netshade <netsh
...@gmail.com> wrote:
> > Currently the aggregators count on a streaming arrival of data from
> > SpanGroup / DataPoints object - and for the metrics provided ( max/min/
> > sum/avg ), data ordering isn't important. For the percentile
> > functions I'm aware of, the way aggregators work would have to be
> > fundamentally changed to do a pre-run to generate the range of numbers
> > to sort to ascertain the percentile. The naive way would be to keep
> > the set in memory, which really just seems terrible to me. The less
> > naive way (though complicated) would seem to be to add a secondary
> > storage for metrics as received to bucket them accordingly and then
> > make the percentile calculation at call time by inspecting bucket
> > contents. Is there any work already ongoing here? Before I jump in,
> > would just be curious to know.
> Two things you need to be aware of:
> (1) As soon as the compaction branch is merged, my next big code
> change will be to completely rewrite the read path to be fully
> asynchronous / non-blocking, and work in a streaming fashion such that
> the TSD will not have to hold all the data points of a query in
> memory.
> (2) As a consequence of (1), all the algorithms used in the
> read-path have to be single-pass algorithms. In other words, they get
> to look at each data point once and can only afford modest amounts of
> extra memory storage.
> To compute a percentile, the naive approach is, as you said, to sort
> all the data points and then find the percentile from there. This
> approach cannot be used in OpenTSDB for the two reasons above. The
> second naive approach is to create a number of buckets and keep counts
> of how many data points fall in each bucket. This approach only works
> if you have a pretty good idea of what the distribution looks like
> before you start, which isn't the case in OpenTSDB since the data is
> arbitrary.
> I too want to have percentile functions, but they need to be
> implemented using state-of-the-hard streaming percentile methods. A
> good starting point is to read "Quantiles on Streams" by Chiranjeeb
> Buragohain and Subhash Suri, as I believe it does a good job at
> summarizing the state of the art.http://www.cs.ucsb.edu/~suri/psdir/ency.pdf
> > Also, one of the other messages in the mailing list seemed to indicate
> > that out of order data arrival may be something OpenTSDB may not
> > support in the future. That is to say, doing:
> > put <the-metric> <2011 timestamp> <value>
> > put <the-metric> <2010 timestamp> <other-value>
> > would be invalid at some near point in OpenTSDB's life due to planned
> It's already "invalid" in the sense that if you send both data points
> to the same TSD, the 2nd one will be rejected. If you send them both
> to different TSDs, it'll work simply because the TSDs don't talk to
> one another in order to keep things simple, and because technically
> right now you can store out-of-order data if you're somewhat careful
> and you know what you're doing. What I mean by that is that there's a
> couple things to be aware of that can cause data problems when you
> start storing data out of order. Most problems can be fixed
> automatically by the "fsck" command, others might require you to go
> and perform some manual surgery on the data table.
> > future optimizations. ( The discussion that led me to believe this
> > may be the case spoke of data-ordering; if I infer the point
> > incorrectly, I apologize ) Is this still the case?
> Yes.
> --
> Benoit "tsuna" Sigoure
> Software Engineer @www.StumbleUpon.com