Two things you need to be aware of:
(1) As soon as the compaction branch is merged, my next big code
change will be to completely rewrite the read path to be fully
asynchronous / non-blocking, and work in a streaming fashion such that
the TSD will not have to hold all the data points of a query in
memory.
(2) As a consequence of (1), all the algorithms used in the
read-path have to be single-pass algorithms. In other words, they get
to look at each data point once and can only afford modest amounts of
extra memory storage.
To compute a percentile, the naive approach is, as you said, to sort
all the data points and then find the percentile from there. This
approach cannot be used in OpenTSDB for the two reasons above. The
second naive approach is to create a number of buckets and keep counts
of how many data points fall in each bucket. This approach only works
if you have a pretty good idea of what the distribution looks like
before you start, which isn't the case in OpenTSDB since the data is
arbitrary.
I too want to have percentile functions, but they need to be
implemented using state-of-the-hard streaming percentile methods. A
good starting point is to read "Quantiles on Streams" by Chiranjeeb
Buragohain and Subhash Suri, as I believe it does a good job at
summarizing the state of the art.
http://www.cs.ucsb.edu/~suri/psdir/ency.pdf
> Also, one of the other messages in the mailing list seemed to indicate
> that out of order data arrival may be something OpenTSDB may not
> support in the future. That is to say, doing:
>
> put <the-metric> <2011 timestamp> <value>
> put <the-metric> <2010 timestamp> <other-value>
>
> would be invalid at some near point in OpenTSDB's life due to planned
It's already "invalid" in the sense that if you send both data points
to the same TSD, the 2nd one will be rejected. If you send them both
to different TSDs, it'll work simply because the TSDs don't talk to
one another in order to keep things simple, and because technically
right now you can store out-of-order data if you're somewhat careful
and you know what you're doing. What I mean by that is that there's a
couple things to be aware of that can cause data problems when you
start storing data out of order. Most problems can be fixed
automatically by the "fsck" command, others might require you to go
and perform some manual surgery on the data table.
> future optimizations. ( The discussion that led me to believe this
> may be the case spoke of data-ordering; if I infer the point
> incorrectly, I apologize ) Is this still the case?
Yes.
--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com