Thank you for your suggestions. I was about to post an update.
Regarding the points 1. & 2., I tried the same earlier today. I defined both sides as a CTE and did a join on them.
with data_curr as (
CONCAT(TIME_FORMAT(__time, 'yyyyMM'), '~', entity) as entity
__time between time_parse('20221101', 'yyyyMMdd') and time_parse('20230401', 'yyyyMMdd')
data_prev as (
CONCAT(TIME_FORMAT(TIMESTAMPADD(month, 1, __time), 'yyyyMM'), '~', entity) as entity
__time between time_parse('20221001', 'yyyyMMdd') and time_parse('20230301', 'yyyyMMdd')
, (count (distinct data_prev.user)*1.0 / count (distinct data_curr.user)) as STICKINESS
data_curr left join data_prev
on data_curr.user = data_prev.user
and data_curr.entity = data_prev.entity
GROUP BY 1
With the above query, the execution time came down to ~1min 15 sec - a significant improvement but we need more. :)
The other issue with the above query is that I am unable to run it for a longer duration (eg: 1Y) as it errors out with a "ResourceLimitExceeded" exception.
druid.server.http.maxSubqueryRows=1500000 --> Sub query is generating >1.5MM rows.
I next tried to run a bunch of UNION ALL queries with each query running for 2 months interval. While this seems to work, it doesn't scale well. The execution time increases linearly with each UNION query taking around 9-10 sec (~60 sec for 6 months). It looks like Druid is not executing the pieces of the UNION ALL query in parallel. Is it possible to tell Druid to run them in parallel?
Coming back to your suggestions:
1. What you are suggesting is to have just one CTE and do a self join on it. I will try this as well and revert.
3. There are about 190K distinct users for the last year. Can Druid handle such large numbers in an IN query?
4. This doesn't help as we need to find stickiness at entity level.