Hi all,
And I wanted to put it directly to the developers as I'm sure that you'll be the most knowledgeable on the subject.
The basic request is that I would like to Randomly sample the data before the rollups and aggregations are applied.
For instance I have hundreds of thousands of users with a unique numerical ID using a web app and each session sends events which make their way into Druid. When building some analysis or model on the data I want to:
- Sample the users e.g. only use a random 10% of the users and all of their events
- Sample the events e.g. only use 10% of all events across all users
I've so far considered just generating a random 4 digit "index" to append to the event data as it is ingested into Druid, and also coming up with a filtering scheme for the unique numerical ID.
But I wanted to see if there's a better way to do this in Druid?