Random Sampling with Druid

tvan...@cardinalpeak.com

unread,

Oct 16, 2013, 1:08:49 PM10/16/13

to druid-de...@googlegroups.com

Hi all,

I posted this question on stackoverflow as well: http://stackoverflow.com/questions/19391400/random-sampling-from-druid-databases (if you want to answer it there and get some points).

And I wanted to put it directly to the developers as I'm sure that you'll be the most knowledgeable on the subject.

The basic request is that I would like to Randomly sample the data before the rollups and aggregations are applied.

For instance I have hundreds of thousands of users with a unique numerical ID using a web app and each session sends events which make their way into Druid. When building some analysis or model on the data I want to:

Sample the users e.g. only use a random 10% of the users and all of their events
Sample the events e.g. only use 10% of all events across all users

I've so far considered just generating a random 4 digit "index" to append to the event data as it is ingested into Druid, and also coming up with a filtering scheme for the unique numerical ID.

But I wanted to see if there's a better way to do this in Druid?

Eric Tschetter

unread,

Oct 17, 2013, 10:23:46 AM10/17/13

to druid-de...@googlegroups.com

tvanrooy,

Currently, there is no built in mechanism for sampling. I think it should be possible to implement the sampling as a sort of filtering firehose or something and have it be configuration driven.

In general, when we run into dimensions that have a crazy high cardinality, we generally do one of two things. (1) run a sketch on top of it and store the sketch instead of the dimensions or (2) decide that it's not actually important and don't include it. Depending on what kind of analysis you are trying to do, the sketch approach might or might not work for you. Also, the sketches that Metamarkets has implemented are currently not available as open source plugins, so it would require some implementing to make them happen (multiple people recently have asked about this functionality though, so hopefully something will become available sooner rather than later).

So, short answer is that whatever sampling algorithm you want, you will have to implement somewhere along the ingestion path. If you implement it in a way that you think other people might be able to take advantage of, I think we'd be very happy to take the contribution.

--Eric

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/94598bc3-551c-44bb-af8b-ffd9de5c7dc8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Theo Van Rooy

unread,

Oct 17, 2013, 1:16:41 PM10/17/13

to druid-de...@googlegroups.com

Thanks for the response.

Do you have any resources/documentation/otherthreads that you can refer me to in regard to sketches?

For a first crack we'll try ingesting a random index along with the records along with a hash on our unique userID's, this should give us most of what we need. If it turns out that we develop something as a core part of Druid I'll be happy to share it with the community.

Theo

--
You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/uyQLEKNViqE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB8U%2Bh2cwxWgk3VrPmC%2B3P-vg6yKF%3DDa%2BhVL-PNQn%2Bq6bEWYVg%40mail.gmail.com.

Eric Tschetter

unread,

Oct 21, 2013, 9:13:55 AM10/21/13

to druid-de...@googlegroups.com

Theo,

Do you have any resources/documentation/otherthreads that you can refer me to in regard to sketches?

There are some random notes, but generally I've been telling people who want to implement this to hop in the IRC channel and I can help explain things if they have problems. The basic idea of what needs to be implemented to do sketches is

1) AggregatoryFactory for some aggregators that understand the new type.

2) ComplexMetricSerde defines how to serialize and deserialize the sketch

3) You must register the ComplexMetricSerde using the ComplexMetrics class.

And you can read the blog post we wrote about the sketch that we implemented internally.

http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/

For a first crack we'll try ingesting a random index along with the records along with a hash on our unique userID's, this should give us most of what we need. If it turns out that we develop something as a core part of Druid I'll be happy to share it with the community.

Keep us informed of how it goes :).

--Eric

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAEg0S0%2BsLTo%3D%3DF6d34aFiMYXksCCKAgfFb9AgFOQaoTj%2BSuqog%40mail.gmail.com.

Theodore Van Rooy

unread,

Oct 29, 2013, 5:27:16 PM10/29/13

to druid-de...@googlegroups.com

Well, to report back...

We went ahead and implemented a radom index. Essentially at the ingest we generate a random number and pop it into a field. While the index remains "static" it is more than adequate for random sampling as there are 16-17 digits of precision in the index.

To comlete a random sample I simply do a regex on the random index field and select all values say that start with 0.1 "^0.1" which gives me 10% of the cases.

I could likewise get 1% of the data with "^0.01" and so on.

I suppose if you are doing some model building which requires repeated sampling you could quickly overuse your 10% samples... but other than that this approach is fast and works well for getting a small random subset of your data.

Theo

Reply all

Reply to author

Forward