Partition table by hours?

Skip to first unread message

Gary Chow

Mar 21, 2018, 5:44:59 AM3/21/18
to Kdb+ Personal Developers

I understand that for partitioned table, the partition domain can be day, month etc, or an integer value. Now, I have a tick data database and it is huge. I would like to partition the table by something smaller than a day. For example, by hour. Is that possible to do that in KDB?

The only thing that I can think of is convert the datetime to nanosecond, create a column that is (nanosecond / 1e9*3600), and whenever I need a query, I will match this column first. Is that the "correct" way to achieve what I wanna do, or is there some other better method?



Sean O'Hagan

Mar 21, 2018, 11:07:12 AM3/21/18
to Kdb+ Personal Developers
Yes that's possible, just make sure you don't partition by 'hour' otherwise you'll end up with 24 partitions with multiple days in each
You want to partition by date+hour (i.e. another level of granularity on top of date) which you have correctly pointed out in your post.

update dth:`long$t-t mod 60*60*1e9 from tab // would create a long column from the date and hour which you could use as the partition domain, but will have redundant 0s in the domain's value. It does however have certain advantages
update dth:`long$t div 60*60*1e9 from tab // would be a nicer domain, in that each subsequent hour is an increment of 1, and it just looks smaller/easier to work with

This 'hourly' method is entirely feasible for storing the data, but there are some caveats when you come to query the data - like you said the partition column must be matched first for efficient querying and complexity ... 
 where date=date is easy to understand
 where int=randomNumber is slightly more complex
If everything is placed behind an api, then I guess you should have no problems - you have control of how your clients will query, so you can always ensure the partition column is filtered on first and matched/used correctly

Below is a working example of how it can work

// test tab, 24 rows across 2 days, between 9 and 11am, every 10 mins
n:24; tab:([]s:n?`a`b`c;t:raze (.z.d-til 2)+\:raze 09:00 10:00+\:10*til 6;v:n?100);

// auto partition
\d .Q
k)dpfgnt:{[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];{[d;g;t;i;x]@[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'!r;@[;f;`p#]@[d;`.d;:;f,r@&~f=r:!r];n};
k)dcfgnt:{[d;c;f;g;n;t]*p dpfgnt[d;;f;g;n]'?[t;;0b;()]',:'(=;c;)'p:?[;();();c]?[t;();1b;(,c)!,c]};
\d .

// write it out and add datetime-hour(dth) column (note - this column will also be written to disk, but its not needed as it will be the partition domain)
// choosing this method to show how you can use '=' directly
.Q.dcfgnt[`:db;`dth;`s;,;`tab] update dth:`long$t-t mod `long$60*60*1e9 from tab;
\l db

// eyeball
// whats nice is you can use '=' here with int=hour as '=' doesn't compare types
q)select from tab where int=2018.03.20D09
int                s t                             v  dth
574851600000000000 a 2018.03.20D09:00:00.000000000 38 574851600000000000
574851600000000000 b 2018.03.20D09:40:00.000000000 68 574851600000000000
574851600000000000 b 2018.03.20D09:50:00.000000000 45 574851600000000000
574851600000000000 c 2018.03.20D09:10:00.000000000 97 574851600000000000
574851600000000000 c 2018.03.20D09:20:00.000000000 88 574851600000000000
574851600000000000 c 2018.03.20D09:30:00.000000000 58 574851600000000000

// check partition counts (should be 6 in each partition)
(`timestamp$value! tab
2018.03.20D09:00:00.000000000| 6
2018.03.20D10:00:00.000000000| 6
2018.03.21D09:00:00.000000000| 6
2018.03.21D10:00:00.000000000| 6

// check upserts work as expected
// get partition for 9-10am today

// upsert row for random time between 9 and 10am
// total count should go up by 1
`:./574938000000000000/tab/ upsert (`:sym?`a;first .z.d+09:00+1?60;100;574938000000000000)
\l .

// now 25 rows
count t

// eyeball
select from tab where int=.z.d+09:00
int                s t                             v   dth
574938000000000000 a 2018.03.21D09:00:00.000000000 12  574938000000000000
574938000000000000 a 2018.03.21D09:50:00.000000000 90  574938000000000000
574938000000000000 b 2018.03.21D09:10:00.000000000 10  574938000000000000
574938000000000000 b 2018.03.21D09:30:00.000000000 90  574938000000000000
574938000000000000 c 2018.03.21D09:20:00.000000000 1   574938000000000000
574938000000000000 c 2018.03.21D09:40:00.000000000 73  574938000000000000
574938000000000000 a 2018.03.21D09:33:00.000000000 100 574938000000000000

// unfortunately 'in' doesn't work the same though as it checks types of the RHS, so you'd have to do something slightly different in your queries/apis
int in .z.d+09:00
int in .z.d+09:00 10:00
select from tab where int in `long$.z.d+09:00 10:00

// as for the second approach (i.e. remove the minutes, seconds and  nanos from the domain), it will work fine too, as long as you know how to get to and fro domain -> timestamp
`long$2018.03.21D09 div `long$60*60*1e9
159705 // this would become the domain for 2018.03.21D09 and 159706 would be the domain for 2018.03.21D10, and so forth

// with this method, queries would become
f:{`long$x div 60*60*1e9}
select from tab where int=f 2018.03.21D09
select from tab where int in f 2018.03.21D09 2018.03.21D10
Reply all
Reply to author
0 new messages