Thousands of datasources vs one big

Julien Lavigne du Cadet

unread,

May 13, 2014, 9:03:59 AM5/13/14

to druid-de...@googlegroups.com

Hi,

As part of evaluating druid, we’re wondering if there is a best practice regarding the number of datasources. Let me explain.

We have a datasource with:

- a partner id dimension,

- some dimensions that are independent of the partner (ex: device type => desktop, mobile, …),

- some dimensions that are dependent of the partner (ex: category id, which is a finite set per partner with a cardinality that can range from a few elements to a few thousands)

And we also have the following characteristics:

- We *always* specify the partner id in our queries (we have other systems to do queries that are cross partners).

- We have a few thousands partners (let’s say 15000 for the conversation).

- For each partner we can have from a few hundreds MBs to a few GBs. The whole dataset is several TBs

We see value in “pre-sharding” by partner Id, that is put each partner id in its own datasource. Why? Mainly because that would allow us to have customs rules for loading/unloading segments for each partners into our tiers. And we’re also wondering if the performance would be better (because we would have to read a lot less segments than if everything is in the same datasource). On the other hand, we would have more segments than if we were storing everything in the same datasource but it’s difficult to quantify (we would have a lot more segments that would not be full).

Would druid support thousands of datasource? Or will we hit some constraints?

Do you have any recommendations towards one pattern or the other?

Thanks in advance,

Julien

Eric Tschetter

unread,

May 13, 2014, 10:32:37 AM5/13/14

to druid-de...@googlegroups.com

The general recommendation is to put them all in the same data source as long as they have the same set of dimensions (the values can be different, the interesting thing is that the names of the dimensions are the same).

As far as scanning less data is concerned, that's what the indexes are for. It'll automatically scan less data when you specify a partner is to filter on.

This does mean that you have less control over individual tiers, though.

--Eric

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/7412bdfb-b0e3-48a4-9047-032098f9e06b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Julien Lavigne du Cadet

unread,

May 14, 2014, 8:57:35 AM5/14/14

to druid-de...@googlegroups.com

I understand that the recommended way is to have one datasource but could you elaborate why it would be better than one datasource per partner?

On Tuesday, May 13, 2014 4:32:37 PM UTC+2, Eric Tschetter wrote:

The general recommendation is to put them all in the same data source as long as they have the same set of dimensions (the values can be different, the interesting thing is that the names of the dimensions are the same).

As far as scanning less data is concerned, that's what the indexes are for. It'll automatically scan less data when you specify a partner is to filter on.

This does mean that you have less control over individual tiers, though.

--Eric

On Tuesday, May 13, 2014, Julien Lavigne du Cadet <julien....@gmail.com> wrote:

Hi,

As part of evaluating druid, we’re wondering if there is a best practice regarding the number of datasources. Let me explain.

We have a datasource with:

-          a partner id dimension,

-          some dimensions that are independent of the partner (ex: device type => desktop, mobile, …),

-          some dimensions that are dependent of the partner (ex: category id, which is a finite set per partner with a cardinality that can range from a few elements to a few thousands)

And we also have the following characteristics:

-          We *always* specify the partner id in our queries (we have other systems to do queries that are cross partners).

-          We have a few thousands partners (let’s say 15000 for the conversation).

-          For each partner we can have from a few hundreds MBs to a few GBs. The whole dataset is several TBs

We see value in “pre-sharding” by partner Id, that is put each partner id in its own datasource. Why? Mainly because that would allow us to have customs rules for loading/unloading segments for each partners into our tiers. And we’re also wondering if the performance would be better (because we would have to read a lot less segments than if everything is in the same datasource). On the other hand, we would have more segments than if we were storing everything in the same datasource but it’s difficult to quantify (we would have a lot more segments that would not be full).

Would druid support thousands of datasource? Or will we hit some constraints?

Do you have any recommendations towards one pattern or the other?

Thanks in advance,

Julien

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.

To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Fangjin Yang

unread,

May 15, 2014, 12:41:42 PM5/15/14

to druid-de...@googlegroups.com

Hi Julien, see inline.

On Tuesday, May 13, 2014 6:03:59 AM UTC-7, Julien Lavigne du Cadet wrote:

Hi,

As part of evaluating druid, we’re wondering if there is a best practice regarding the number of datasources. Let me explain.

We have a datasource with:

-          a partner id dimension,

-          some dimensions that are independent of the partner (ex: device type => desktop, mobile, …),

-          some dimensions that are dependent of the partner (ex: category id, which is a finite set per partner with a cardinality that can range from a few elements to a few thousands)

And we also have the following characteristics:

-          We *always* specify the partner id in our queries (we have other systems to do queries that are cross partners).

-          We have a few thousands partners (let’s say 15000 for the conversation).

-          For each partner we can have from a few hundreds MBs to a few GBs. The whole dataset is several TBs

We see value in “pre-sharding” by partner Id, that is put each partner id in its own datasource. Why? Mainly because that would allow us to have customs rules for loading/unloading segments for each partners into our tiers.

This is what we do.

And we’re also wondering if the performance would be better (because we would have to read a lot less segments than if everything is in the same datasource). On the other hand, we would have more segments than if we were storing everything in the same datasource but it’s difficult to quantify (we would have a lot more segments that would not be full).

One thing to consider is that Druid can automatically merge segments of a datasource together to get some ideal segment size.

Would druid support thousands of datasource? Or will we hit some constraints?

We haven't seen thousands of datasources in production (but we have seen on the order of hundreds), although the code has been designed to accommodate this scale. If there are any constraints, they will likely be less than efficient implementation and should be relatively easy to fix, rather than fundamental architecture problems.

Do you have any recommendations towards one pattern or the other?

If the schemas of the different partners can be made common (same dimensions and metrics), I would recommend a single datasource that is filtered. As Eric mentioned, Druid is smart enough to scan only what it needs when filters are applied. Do you know what the total dimension and metric size would be for all partners? How much is in common versus unique?

Thanks in advance,

Julien

Julien Lavigne du Cadet

unread,

May 20, 2014, 9:40:25 AM5/20/14

to druid-de...@googlegroups.com

"Do you know what the total dimension and metric size would be for all partners? How much is in common versus unique?"

the dimensions with a small cardinality are common to all partners. However we have one dimension per partner that has an average cardinality of 100 and a second of 10. Therefore the cardinality for the whole dataset is ~1 500 000 for the first and ~150 000 for the second. We need to run topN queries on these two dimensions (per partner only!).

As for the metric size, I'm not sure what you are reffering to. We're looking at several TBs of data with significant differences between each partner (some would have a few MBs and others would be at least two order or magnitude bigger (several GBs).

Thanks!

Julien

Fangjin Yang

unread,

May 20, 2014, 7:58:48 PM5/20/14

to druid-de...@googlegroups.com

Hi Julien, I would recommend having multiple datasources for partners. I think you'll be getting a lot more flexibility as you scale.

-- FJ

Reply all

Reply to author

Forward