Loading and querying DBs with custom partitioning - partbyattr

101 views
Skip to first unread message

JerLucid

unread,
Oct 31, 2024, 7:35:44 AM10/31/24
to Data Intellect kdb+/TorQ
In order to seamlessly load and query intraday DBs with custom partitioning, like
those created using the partbyattr configuration, we really need Kx to make some fundamental changes at the core level right? 

Is that something which has ever been discussed with Kx? Is it possible?

I think many users like having the ability to sort data by multiple columns,
and the simple upsert at EOD is very efficient and fast. But it always seems
like an unfinished option without having the ability to easily query the 
partitioned data intraday. Having the native ability to load and query
could open up a world of possibilities. Just wondering what the general
thoughts are on this? 


  




 

Joe Baines

unread,
Oct 31, 2024, 11:19:25 AM10/31/24
to Data Intellect kdb+/TorQ
Hi JerLucid, 

In the most recent release of TorQ (v5.2.0) - a new intraday database (IDB) process has been added which facilitates the ability to query intraday data. This update also included a new wdb writedown method "partbyenum". The new method has the same EOD sorting benefits as partbyattr however is structured in a manner that kdb can natively load and query the intraday saves with ease. A blog post with more information on both the IDB and partbyenum method can be found below :


If you have any further questions on either of the above, please feel free to reach out and we can provide any further info . 

Kind Regards, 

Joe

Jeremy Lucid

unread,
Nov 1, 2024, 4:09:52 AM11/1/24
to Joe Baines, Data Intellect kdb+/TorQ
Thanks for getting back to me Joe. Yes I've read the blog you linked to, and I'm very happy to see those new additions!

Actually, I specifically mentioned partbyattr because that is the one designed to give users the most flexibility, in that you can partition/sort by multiple columns. The issue is that kdb cannot load that folder structure natively, so we cant query it easily. It's a pity because the WDB has done the hard work of partitioning the data, and it could greatly enhance query speed if kdb had a way to intelligently access the correct table, based on the qsql and folder structure.

In the past, I wrote gateway functions that allowed you to query the custom partioned DB, but in a limited way. I had a process which loaded one table in a given directory path, and switched paths depending on the query arguments. Not a very scalable approach, but it worked ok for adhoc queries. But really what we need is a way for kdb+ to load these kind of custom partition folder structures.
Skipping data we don't need to read is key to performance.

I was wondering if there were ever any discussions with Kx on adding this ability. 






--
dataintellect.com
dataintellect.com/thoughts
---
You received this message because you are subscribed to a topic in the Google Groups "Data Intellect kdb+/TorQ" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kdbtorq/qDRYtbaEiLc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kdbtorq+u...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/kdbtorq/077d6e25-1b1c-491a-a3d3-1327d21fb62an%40googlegroups.com.

Jeremy Lucid

unread,
Nov 1, 2024, 6:51:33 AM11/1/24
to Joe Baines, Data Intellect kdb+/TorQ
Often there are cases where partitioning by int alone (int representing a symbol) is not sufficient, and you need at least one more level of partitioning to access the data efficiently. 

I think 2 levels of int partitioning alone would be amazing to have. intx and inty .
It's easy for the writers to do, but it would need work on the Kx side to load and use. 

Edward Richards

unread,
Feb 5, 2025, 7:32:05 AMFeb 5
to Data Intellect kdb+/TorQ
Bumping an old thread but I agree it would be great for KDB to natively support this. Partitioned Parquet datasets supports multiple levels of partitioning, e.g. .../sym=x/year=2025/month=1/<files.parquet>

As well as improved query performance, this would help with storing data on object storage (i.e. S3). Given the lack of append functionality, you need to re-write entire partitions should you wish to backfill historic data which can be tricky with a sizeable dataset. With a more granular partition structure you're less likely to need to re-persist so much data.

Although I realise raising this on a TorQ forum may have limited utility...

JerLucid

unread,
Feb 10, 2025, 5:59:17 AMFeb 10
to Data Intellect kdb+/TorQ
Thanks for the response Edward. You are correct to highlight that there would also be benefits on the S3 storage side.

It does feel like kdb+ is falling being other DBs in terms of this flexibility to utilise multiple levels of partitioning.
The issue is made more prominent when users of TorQ use partbyattr, where they see immediately that Q has the ability to easily
create multiple partitions using the wdb, but then the data cannot be queried effectively.

I raised the question here, on whether DataIntellect had raised this question to Kx, as a prompt to encourage them to do so, if not.
Such a feature request would carry more weight coming from DataIntellect.

Peter Murphy

unread,
Feb 20, 2025, 9:14:39 AMFeb 20
to Data Intellect kdb+/TorQ
Hi Jeremy, Edward,

Thanks for outlining the use cases for multiple partitioning levels. DI will flag the request with KX.

In the meantime, regarding the partbyattr/partbenum disparity, partbyenum can theoretically work with multiple p# columns, like partbyattr can - the int partitioning would just have to be done on the most granular p# column. This is not the way it is currently implemented though, and TorQ will raise an error if it detects multiple p# sort columns in sort.csv.

We could consider lifting this restriction, but the p# columns that are not used for int partitioning may be slow to filter/group, e.g. if tab has p# on columns par1 and par2, and is int partitioned by par2, then these queries

select from tab where par1=`foo
select max val by par1 from tab

will be slow, and would need to be restructured like

select from tab where int in exec int from (select first par1 by int from tab) where par1=`foo
select max val by par1 from select max val, first par1 by int from tab

which obviously isn't seamless (i.e. in the grouping example we have to implement our own map-reduce operation rather than relying on the built-in version).

Instead, we could consider int partitioning by the first p# column and applying p# to the next (and potentially g# to any others). This would make filtering or grouping fast on all attribute columns, but then the disk partitions would be larger and we would need to re-sort and re-apply p# after every WDB write. Ultimately, re-sorting is not practical, and even using all g# is not practical, since all attributes are lost when upserting to a splayed table on disk.

For these reasons there are no plans to expand the current functionality in TorQ.

The "canonical" way to do arbitrary partitioning in kdb would be anymap, but even this approach would suffer the same problem in that it doesn’t fit naturally with the map-reduce constructs in the usual select statement, e.g.

// simple table example
select sum size by lp, sym from quotes where date=2019.01.01, sym in `c`d

// anymap table example
raze exec {[tab;s] select sum size by lp, sym:s from tab}'[prices;sym] from select from quotesnested where date=2019.01.01, sym in `c`d

Any response from KX will go in this thread.

Kind regards
Peter
Reply all
Reply to author
Forward
0 new messages