This is the query I came up with for this use case, where reprocessed_ts indicates when the record was 'last loaded into druid', so if its newer version of a previous message the inner query will select those.
select count(*) from dupe_test
where
company_name = 'AcmeInc'
and __time >= '2021-03-11 01:00:00' and __time < '2021-03-11 02:00:00'
and reprocessed_ts in
(
select latest(reprocessed_ts)
from dupe_test
where company_name = 'AcmeInc'
and __time >= '2021-03-11 01:00:00' and __time < '2021-03-11 02:00:00'
group by message_id
)
however it feels like I may need to have the inner query concat message_id+reprocessed_ts, to ensure it matches properly on the outer query.
This feels like it should perform better than using a temp table..