Torq and data deduplication

76 views
Skip to first unread message

Gavrila P

unread,
Mar 21, 2025, 9:48:34 AMMar 21
to Data Intellect kdb+/TorQ
Good day for all,

I’m using TorQ to store candle data received from a server. Occasionally, the server sends duplicate data, and I need a way to handle this. Does TorQ provide built-in support for deduplication, or do I need to implement it manually?

I looking for a solution that provides the same capabilities as QuestDB (link to the data deduplication topic in docs).

Thanks a lot.

Joshua Ballantine

unread,
Mar 24, 2025, 12:24:00 PMMar 24
to Data Intellect kdb+/TorQ

Hi,

No there is nothing built in in TorQ for this.

However you should be able to achieve the same de-duplication of data by other methods:
1) Keying the tables in the rdb and upserting instead of inserting in upd. The key would have to be unique for this to work e.g. could be seqNum or time and sym.
2) Only insert exception set in upd, i.e. change upd to {tableName insert newData except tableName}

Be advised changing the upd in this way will be more computationally expensive particularly as size of table in memory grows, illustrated by the example below:

q)tabUnkeyed:([]time:("p"$.z.d)+0D00:00:00.05*til 15000000;sym:?[15000000;`3];price:15000000?100)
q)tabKeyed:([time:("p"$.z.d)+0D00:00:00.05*til 15000000;sym:?[15000000;`3]]price:15000000?100)
q)new:enlist each([]time:("p"$.z.d)+0D00:00:00.05*til 150;sym:?[150;`3];price:150?100)
q)\t {`tabUnkeyed insert x}each new
0
q)\t {`tabKeyed upsert x}each new
1827

If the de-duplication is not required to be real-time, I would suggest implementing either of the above methods at EOD (or periodically throughout the day on a timer job) rather than in real-time. i.e. run something like {0!select by seqNum from tableName} OR {0!select by time,sym from tableName} depending on what columns define a unique key.

If the de-duplication is required to be real-time, consider implementing batching if using either of the above 2 methods so that the upd change above doesn't cause the process to fall behind processing updates - whether this is needed will depend on the volume/frequency of the incoming data

Gavrila P

unread,
Apr 2, 2025, 2:51:55 AMApr 2
to Data Intellect kdb+/TorQ
Joshua, 
Thank you very much for such a detailed answer. Unfortunately, my current level of knowledge q and torq didn’t allow me to implement your suggestions. I ended up doing it at a higher level in C++. I plan to return back to this issue later on q.
Gavrila

понедельник, 24 марта 2025 г. в 19:24:00 UTC+3, Joshua Ballantine:
Reply all
Reply to author
Forward
0 new messages