Versioned data - using latest record

saurabh singh

unread,

Feb 22, 2021, 6:29:03 PM2/22/21

to druid...@googlegroups.com

Hi Druid Users,

I am new to Druid. Our team is building a dashboarding solution and we chose Druid as our data warehouse. We stream data through Kafka into Druid.

I am trying to understand something basic here around updating records. Say I have a Druid data source called Order. Order has a status field, and the status keeps changing based on the life cycle of the order (New -> OrderSent -> Accepted).

Questions:

1) Can I update the same Order record, every time its status changes as per Kafka event ( New -> OrderSent -> Accepted )?

2) If the above status field mutation is not a suggested approach, then will Druid create 3 records for same order when its status transitions, as shown below? If yes, how can my BI tool use the very last record and ignore the first two records from the selection? Currently my BI tool uses JDBC to call Druid for OLAP.

OrderId, Status

001, New

001, OrderSent

001, Accepted

Apologies if this is too basic, but it's not straight forward for someone like me coming from a SQL background.

Thanks,

Saurabh

Ben Krug

unread,

Feb 22, 2021, 8:04:08 PM2/22/21

to druid...@googlegroups.com

Hi Saurabh -

Not too basic, and you're right, it's different from RDBMS SQL. You can't do updates on an existing record. You can overwrite them, but then everything is grouped by time intervals, and you'd have to overwrite everything for that time interval. (The storage is in immutable segments, which are arranged by time intervals, and can be entirely - or not at all - overwritten.)

With native druid queries, you can use the stringLast aggregator to get the latest value. With druid SQL, I guess you could probably do some subquery logic, but it might get a little messy or performance might suffer. Eg,

select * from myDataSource
where someKey = someVal
and "__time" =
(select max ("__time")
from myDataSource where someKey = someVal)

(I'm guessing there's a better way - that just comes to the top of my head.)

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CAJxo-gk0kdHu2kd65JERwQi8NOyxnf6mnVE-BW01ccVL-mKYRA%40mail.gmail.com.

saurabh singh

unread,

Feb 23, 2021, 12:40:43 PM2/23/21

to druid...@googlegroups.com

Thanks a lot Ben. Good to know.

Regards,

Saurabh

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CAP%2BetTvzPPA6EYcRQPKVFkbXbUyKLmF_pPu5XMyUFzubiT_t2g%40mail.gmail.com.

--

Saurabh Kumar Singh
+44 (0) 754 525 5396

Peter Marshall

unread,

Mar 10, 2021, 3:07:08 AM3/10/21

to Druid User

Hey! Also note that you have the LATEST function (https://druid.apache.org/docs/latest/querying/sql.html#aggregation-functions).

Please do also be aware that as Druid is partly a time-series database, the sense of "row-level updates" is not the same as you would be familiar with in a traditional database: updates to Druid data are performed for all data over an interval of time versus your traditional UPDATE command.

danho...@gmail.com

unread,

Mar 12, 2021, 10:36:45 AM3/12/21

to Druid User

This is the query I came up with for this use case, where reprocessed_ts indicates when the record was 'last loaded into druid', so if its newer version of a previous message the inner query will select those.

select count(*) from dupe_test

where

company_name = 'AcmeInc'

and __time >= '2021-03-11 01:00:00' and __time < '2021-03-11 02:00:00'

and reprocessed_ts in

(

select latest(reprocessed_ts)

from dupe_test

where company_name = 'AcmeInc'

and __time >= '2021-03-11 01:00:00' and __time < '2021-03-11 02:00:00'

group by message_id

)

however it feels like I may need to have the inner query concat message_id+reprocessed_ts, to ensure it matches properly on the outer query.

This feels like it should perform better than using a temp table..

Peter Marshall

unread,

Mar 23, 2021, 8:49:27 AM3/23/21

to Druid User

FYI I've seen IN take more than one parameter - like where (x1, y1) IN (SELECT x2 y2 FROM bananas) - maybe you could use that for your compound?

Reply all

Reply to author

Forward