Time series, best practise for queries between two points in time?

76 views
Skip to first unread message

Timo Pulkkinen

unread,
Sep 16, 2015, 10:26:24 AM9/16/15
to OrientDB
Hello,

we have created a graph (or a tree) for time series data as described in this article (  https://github.com/orientechnologies/orientdb-docs/blob/master/Time-series-use-case.md ). The actual tree is as follows:

Year -> month (map) -> Month -> day (map) -> Day -> hour (map) -> Hour -> minute (map) -> Minute -> second (map) -> Second -> millisecond (map) -> Millisecond -> Document 

Document contains also the timestamp as a property alongside the actual data values.

Now we are wondering if this tree is actually suitable for one our use cases that require a query which "returns all available documents between timestamp1 and timestamp2  where the timestamps are epoch timestamps in milliseconds and can span several years"? And if it does, what would be the most performant query these documents from the tree?

br,
Timo

Timo Pulkkinen

unread,
Sep 16, 2015, 4:42:19 PM9/16/15
to orient-...@googlegroups.com
In other words, should we have links between the adjacent time units or documents ("next") to be able to traverse the graph, or is the "time span" query somehow doable otherwise? This most certainly is a beginner (FAQ) level question, but didn't find any exact question or slideset / doc that covers this use case?

br,
Timo

Curtis Stanford

unread,
Sep 16, 2015, 6:40:09 PM9/16/15
to OrientDB
What I've done is this:

1. Find the vertex for the start date (or closest one after it).
2. Increment your finest time unit (in my case seconds, but it may be days) until you find a match in the map.
3. When you get past the maximum value for that unit (59 in the case of seconds), move up to the parent vertex and repeat.
4. When you get a match in a parent vertex, you have to move down again to the children and iterate through those.

I just have to keep track of the current vertex for each part of the time (i.e. year, month, day, ...) so I can move up to coarser time units when needed.

Probably not very clear but it's a fairly complicated algorithm. However, it works very quickly.

Timo Pulkkinen

unread,
Sep 17, 2015, 2:33:59 PM9/17/15
to orient-...@googlegroups.com
Thanks Curtis,

we were thinking something similar, but then thought that we could simplify the search algorithm by following additions (using documents because we a currently using Document API):

"find all documents between timestamp_1 and timestamp_2"

1. Add a link between adjacent documents ( i.e. add a property 'next' to a document => points to the next document available in time series) 
(2. Keep track of the timestamp list tail somewhere, so we can always quickly determine the latest data point)
3. Create an index of the timestamps in the documents
4. Find the closest document to timestamp_1 using the index (that is >= timestamp_1)
5. Traverse the links until we reach the closest document matching timestamp_2 (that is =< timestamp_2)

Here the obvious problem is the potential size of the index, but in the other hand it is only accessed to find the starting node. 
And of course in this case the actual time series tree would only be used for possible aggregate calculations.

The other idea was to create an additional hierarchy, where the documents are linked to an "epoch day" vertex, that represents days from Unix epoch starting date (one instance per a day after 01.01.1970). Then we could partition the search by first calculating the "epoch day" of timestamp_1 and search the best matching timestamp linked to it. This way we would have continuous "epoch day" variable instead of repetitive and inconstant time units and for example queries spanning multiple years would be easier to do. Of course for aggregates the usual "time tree" would be available besides this. 

What do you think?

Timo


Curtis Stanford

unread,
Sep 17, 2015, 6:41:36 PM9/17/15
to OrientDB
For your first solution, I don't see the point in using the time series structure at all. You're just using an indexed timestamp field and next pointers to traverse the range. The solution I outlined doesn't need any index or next links. Finding the first date is very fast and, no matter how big the range, I get data coming back immediately. I don't know for sure, but I imagine those next links would be a bit of a maintenance hassle.

Your "epoch day" idea seems to be a coarser version of the original time series structure. The time series tree (with year, month, day, hour, etc.) is basically the ultimate in partitioning and sub-paritiioning. Once you figure out the general querying algorithm, the data structures themselves are simple and fairly easy to maintain and you can do any kind of aggregation or query for any time range at any granularity.

Of course, this day or month partitioning is mandatory for someone using a key-value store or something like Cassandra with a partition key but we can do better, no?
Reply all
Reply to author
Forward
0 new messages