Pre-filtering by range on sorted field

17 views
Skip to first unread message

Ken Seehart

unread,
May 15, 2019, 7:45:18 PM5/15/19
to bcolz
I've got on the order of 1GB of data, that I want to filter by timestamp and an arbitrary expression. I need latency of less than 100ms. I can preprocess the data (e.g build indexes) to improve latency.

The timestamp field is known to be increasing.

Currently I'm simply combining the timestamp range condition with the expression and calling table.where(expr). That works but it's order(N) in the size of the data even if the timestamp range is small. I need a solution where the O(N) component is negligible, so the latency should be roughly proportional to the timestamp range, rather than the size of the data.

I need a way to tell bcolz to skip entire blocks or chunks for which timestamp is outside the specified timestamp range.

To facilitate this, I can preprocess the data by building a block index on the timestamp. If I go this route, I would need to know how to get bcolz to only search a specified range of blocks. 

Is there any way to get at the block processing loop? Using iterblocks or whereblocks doesn't seem to do the trick because data gets loaded into memory for every block, whether I want them or not. Unfortunately, the skip argument is applied to the output sequence rather than the data.

If there isn't a way to fast forward like this, I'm probably better off abandoning bcolz and just use an uncompressed memory mapped file.


Francesc Alted

unread,
May 16, 2019, 8:17:07 AM5/16/19
to Bcolz
While you are right about the fact that the condition needs to be parsed completely, even if it is O(N), the key point is that if you can preprocess the timestamps and get a boolean array on the timestamp range that you are interested in.  Such an array turns out to be highly compressible, and it typically fits in caches of modern CPUs; this fact, combined with the extreme speed of Blosc, makes the selections based on boolean arrays very fast, making this O(N) almost negligible for many cases.

But, talking is cheap, so better have a look at this notebook for a better demonstration:


Missatge de Ken Seehart <kense...@gmail.com> del dia dj., 16 de maig 2019 a les 1:45:
--
You received this message because you are subscribed to the Google Groups "bcolz" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bcolz+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bcolz/79e260eb-a384-4754-bc0e-e498cd3b8957%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Francesc Alted
Message has been deleted

Ken Seehart

unread,
May 16, 2019, 1:38:35 PM5/16/19
to bcolz
Thanks Francesc, that notebook is super helpful!

My performance goal is to minimize latency of a response to a query that specifies an arbitrary timestamp range and an arbitrary boolean expression on an arbitrary subset of the data columns.

Therefore the preprocessing phase against the specific timestamp range requested would would add to the latency, (as such, I wouldn't call it preprocessing). However, I can preprocess to obtain a timestamp bin index of the entire data set. Maybe some ornate scheme involving overlapping timestamp bins of multiple sizes. So an arbitrary timestamp range converts to a simple boolean function of bins.That would be a pre-filter of timestamps, Also, the data tends to be clumped, so I could have clump membership vectors. Many possibilities.

You've given me some new ingredients to work with. In practice some expressions are more popular than others, so the most common could be preprocessed into boolean vectors.

Using numba looks great! I can't use numexpr because it has to many type constraints, so I've been using vm='dask'. But I think numba would be an awesome addition to my tool box.

To unsubscribe from this group and stop receiving emails from it, send an email to bc...@googlegroups.com.


--
Francesc Alted

Ken Seehart

unread,
May 20, 2019, 9:52:35 PM5/20/19
to bcolz
Suddenly I get it (a bit slowly). When you say "preprocess", you mean construct a boolean vector without bcolz, trivially, like ...0000011111000... from the given timestamp array, in log(N) time. Then use that column with bcolz as the filter. Very clever.
Reply all
Reply to author
Forward
0 new messages