I've got on the order of 1GB of data, that I want to filter by timestamp and an arbitrary expression. I need latency of less than 100ms. I can preprocess the data (e.g build indexes) to improve latency.
The timestamp field is known to be increasing.
Currently I'm simply combining the timestamp range condition with the expression and calling table.where(expr). That works but it's order(N) in the size of the data even if the timestamp range is small. I need a solution where the O(N) component is negligible, so the latency should be roughly proportional to the timestamp range, rather than the size of the data.
I need a way to tell bcolz to skip entire blocks or chunks for which timestamp is outside the specified timestamp range.
To facilitate this, I can preprocess the data by building a block index on the timestamp. If I go this route, I would need to know how to get bcolz to only search a specified range of blocks.
Is there any way to get at the block processing loop? Using iterblocks or whereblocks doesn't seem to do the trick because data gets loaded into memory for every block, whether I want them or not. Unfortunately, the skip argument is applied to the output sequence rather than the data.
If there isn't a way to fast forward like this, I'm probably better off abandoning bcolz and just use an uncompressed memory mapped file.