Hi,
As part of evaluating druid, we’re wondering if there is a best practice regarding the number of datasources. Let me explain.
We have a datasource with:
- a partner id dimension,
- some dimensions that are independent of the partner (ex: device type => desktop, mobile, …),
- some dimensions that are dependent of the partner (ex: category id, which is a finite set per partner with a cardinality that can range from a few elements to a few thousands)
And we also have the following characteristics:
- We *always* specify the partner id in our queries (we have other systems to do queries that are cross partners).
- We have a few thousands partners (let’s say 15000 for the conversation).
- For each partner we can have from a few hundreds MBs to a few GBs. The whole dataset is several TBs
We see value in “pre-sharding” by partner Id, that is put each partner id in its own datasource. Why? Mainly because that would allow us to have customs rules for loading/unloading segments for each partners into our tiers. And we’re also wondering if the performance would be better (because we would have to read a lot less segments than if everything is in the same datasource). On the other hand, we would have more segments than if we were storing everything in the same datasource but it’s difficult to quantify (we would have a lot more segments that would not be full).
Would druid support thousands of datasource? Or will we hit some constraints?
Do you have any recommendations towards one pattern or the other?
Thanks in advance,
Julien
--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/7412bdfb-b0e3-48a4-9047-032098f9e06b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The general recommendation is to put them all in the same data source as long as they have the same set of dimensions (the values can be different, the interesting thing is that the names of the dimensions are the same).As far as scanning less data is concerned, that's what the indexes are for. It'll automatically scan less data when you specify a partner is to filter on.This does mean that you have less control over individual tiers, though.
--Hi,
As part of evaluating druid, we’re wondering if there is a best practice regarding the number of datasources. Let me explain.
We have a datasource with:
- a partner id dimension,
- some dimensions that are independent of the partner (ex: device type => desktop, mobile, …),
- some dimensions that are dependent of the partner (ex: category id, which is a finite set per partner with a cardinality that can range from a few elements to a few thousands)
And we also have the following characteristics:
- We *always* specify the partner id in our queries (we have other systems to do queries that are cross partners).
- We have a few thousands partners (let’s say 15000 for the conversation).
- For each partner we can have from a few hundreds MBs to a few GBs. The whole dataset is several TBs
We see value in “pre-sharding” by partner Id, that is put each partner id in its own datasource. Why? Mainly because that would allow us to have customs rules for loading/unloading segments for each partners into our tiers. And we’re also wondering if the performance would be better (because we would have to read a lot less segments than if everything is in the same datasource). On the other hand, we would have more segments than if we were storing everything in the same datasource but it’s difficult to quantify (we would have a lot more segments that would not be full).
Would druid support thousands of datasource? Or will we hit some constraints?
Do you have any recommendations towards one pattern or the other?
Thanks in advance,
Julien
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
Hi,
As part of evaluating druid, we’re wondering if there is a best practice regarding the number of datasources. Let me explain.
We have a datasource with:
- a partner id dimension,
- some dimensions that are independent of the partner (ex: device type => desktop, mobile, …),
- some dimensions that are dependent of the partner (ex: category id, which is a finite set per partner with a cardinality that can range from a few elements to a few thousands)
And we also have the following characteristics:
- We *always* specify the partner id in our queries (we have other systems to do queries that are cross partners).
- We have a few thousands partners (let’s say 15000 for the conversation).
- For each partner we can have from a few hundreds MBs to a few GBs. The whole dataset is several TBs
We see value in “pre-sharding” by partner Id, that is put each partner id in its own datasource. Why? Mainly because that would allow us to have customs rules for loading/unloading segments for each partners into our tiers.
And we’re also wondering if the performance would be better (because we would have to read a lot less segments than if everything is in the same datasource). On the other hand, we would have more segments than if we were storing everything in the same datasource but it’s difficult to quantify (we would have a lot more segments that would not be full).
Would druid support thousands of datasource? Or will we hit some constraints?
Do you have any recommendations towards one pattern or the other?
Thanks in advance,
Julien