Partition table by hours?

369 views

Skip to first unread message

Gary Chow

unread,

Mar 21, 2018, 5:44:59 AM3/21/18

to Kdb+ Personal Developers

Hi,

I understand that for partitioned table, the partition domain can be day, month etc, or an integer value. Now, I have a tick data database and it is huge. I would like to partition the table by something smaller than a day. For example, by hour. Is that possible to do that in KDB?

The only thing that I can think of is convert the datetime to nanosecond, create a column that is (nanosecond / 1e9*3600), and whenever I need a query, I will match this column first. Is that the "correct" way to achieve what I wanna do, or is there some other better method?

Thanks

Gary

Sean O'Hagan

unread,

Mar 21, 2018, 11:07:12 AM3/21/18

to Kdb+ Personal Developers

Yes that's possible, just make sure you don't partition by 'hour' otherwise you'll end up with 24 partitions with multiple days in each

You want to partition by date+hour (i.e. another level of granularity on top of date) which you have correctly pointed out in your post.

update dth:`long$t-t mod 60*60*1e9 from tab // would create a long column from the date and hour which you could use as the partition domain, but will have redundant 0s in the domain's value. It does however have certain advantages

update dth:`long$t div 60*60*1e9 from tab // would be a nicer domain, in that each subsequent hour is an increment of 1, and it just looks smaller/easier to work with

This 'hourly' method is entirely feasible for storing the data, but there are some caveats when you come to query the data - like you said the partition column must be matched first for efficient querying and complexity ...

where date=date is easy to understand

where int=randomNumber is slightly more complex

See here for some more info http://code.kx.com/wiki/JB:KdbplusForMortals/partitioned_tables#1.3.7_Examples_of_Partition_Domains

If everything is placed behind an api, then I guess you should have no problems - you have control of how your clients will query, so you can always ensure the partition column is filtered on first and matched/used correctly

Below is a working example of how it can work

// test tab, 24 rows across 2 days, between 9 and 11am, every 10 mins

n:24; tab:([]s:n?`a`b`c;t:raze (.z.d-til 2)+\:raze 09:00 10:00+\:10*til 6;v:n?100);

// auto partition

\d .Q

k)dpfgnt:{[d;p;f;g;n;t]if[~&/qm'r:+en[d]t;'`unmappable];{[d;g;t;i;x]@[d;x;g;t[x]i]}[d:par[d;p;n];g;r;<r f]'!r;@[;f;`p#]@[d;`.d;:;f,r@&~f=r:!r];n};

k)dcfgnt:{[d;c;f;g;n;t]*p dpfgnt[d;;f;g;n]'?[t;;0b;()]',:'(=;c;)'p:?[;();();c]?[t;();1b;(,c)!,c]};

\d .

// write it out and add datetime-hour(dth) column (note - this column will also be written to disk, but its not needed as it will be the partition domain)

// choosing this method to show how you can use '=' directly

.Q.dcfgnt[`:db;`dth;`s;,;`tab] update dth:`long$t-t mod `long$60*60*1e9 from tab;

\l db

// eyeball

// whats nice is you can use '=' here with int=hour as '=' doesn't compare types

q)select from tab where int=2018.03.20D09

int s t v dth

------------------------------------------------------------------------

574851600000000000 a 2018.03.20D09:00:00.000000000 38 574851600000000000

574851600000000000 b 2018.03.20D09:40:00.000000000 68 574851600000000000

574851600000000000 b 2018.03.20D09:50:00.000000000 45 574851600000000000

574851600000000000 c 2018.03.20D09:10:00.000000000 97 574851600000000000

574851600000000000 c 2018.03.20D09:20:00.000000000 88 574851600000000000

574851600000000000 c 2018.03.20D09:30:00.000000000 58 574851600000000000

// check partition counts (should be 6 in each partition)

(`timestamp$value .Q.pf)!.Q.cn tab

2018.03.20D09:00:00.000000000| 6

2018.03.20D10:00:00.000000000| 6

2018.03.21D09:00:00.000000000| 6

2018.03.21D10:00:00.000000000| 6

// check upserts work as expected

// get partition for 9-10am today

`long$.z.d+09:00

574938000000000000

// upsert row for random time between 9 and 10am

// total count should go up by 1

`:./574938000000000000/tab/ upsert (`:sym?`a;first .z.d+09:00+1?60;100;574938000000000000)

`:./574938000000000000/tab/

\l .

// now 25 rows

count t

// eyeball

select from tab where int=.z.d+09:00

int s t v dth

-------------------------------------------------------------------------

574938000000000000 a 2018.03.21D09:00:00.000000000 12 574938000000000000

574938000000000000 a 2018.03.21D09:50:00.000000000 90 574938000000000000

574938000000000000 b 2018.03.21D09:10:00.000000000 10 574938000000000000

574938000000000000 b 2018.03.21D09:30:00.000000000 90 574938000000000000

574938000000000000 c 2018.03.21D09:20:00.000000000 1 574938000000000000

574938000000000000 c 2018.03.21D09:40:00.000000000 73 574938000000000000

574938000000000000 a 2018.03.21D09:33:00.000000000 100 574938000000000000

// unfortunately 'in' doesn't work the same though as it checks types of the RHS, so you'd have to do something slightly different in your queries/apis

int=.z.d+09:00

0010b

int in .z.d+09:00

0010b

int in .z.d+09:00 10:00

'type

select from tab where int in `long$.z.d+09:00 10:00

...data...

// as for the second approach (i.e. remove the minutes, seconds and nanos from the domain), it will work fine too, as long as you know how to get to and fro domain -> timestamp

`long$2018.03.21D09

574938000000000000

`long$2018.03.21D09 div `long$60*60*1e9

159705 // this would become the domain for 2018.03.21D09 and 159706 would be the domain for 2018.03.21D10, and so forth

// with this method, queries would become

f:{`long$x div 60*60*1e9}

select from tab where int=f 2018.03.21D09

select from tab where int in f 2018.03.21D09 2018.03.21D10

Reply all

Reply to author

Forward

0 new messages