BigCouch documentation recommends to overshard because you can't (currently) add shards later, and the number of shards limits your scalability. You can only have up to Q x N nodes in a cluster, and they will have 1 shard each. Adam@cloudant's talk on BigCounch says it makes sense to overprovision as long as there are file descriptors to handle it and uses 64 as an example.
So I want to pick a really big value for Q, right? Our system is expected to grow by TB daily, and exceed PB over 25 years. I figure thousands of shards would be overkill for now, but let us continue to scale as scientists around the world try to read huge data sets from our multi-PB store. But…
Our puny test system was created with only 256 shards (Q=256, N=3, W=R=2).
It crashed and burned on fairly small batch updates (<500KB) on a 3-node cluster (each with 3 cores, 6gb RAM). I didn't expect Q to have such an impact on performance, but it really does. As Q goes up, the time to do the bulk insert goes up dramatically:
Q sec
4 6.4
8 7.7
16 10.8
32 16.9
64 37.0
When I started with BigCouch a couple months ago, I thought this would be able to scale from a couple of regular PCs to a sizable server farm as long as Q was big enough. Now it looks like you either have to pick Q low enough to run on those regular PCs (but then you can't scale past a couple dozen nodes) or you need to start with a big HW investment to support Q high enough to scale into petabytes.
Am I missing something? Has anyone else run into this? Why would it take so much more CPU? Has anyone worked around this?
And sorry adam@cloudant, this isn't limited by file descriptors. Tried upping that limit too...
On Friday, October 12, 2012 12:16:14 AM UTC+2, ooi wrote:
> BigCouch documentation recommends to overshard because you can't > (currently) add shards later, and the number of shards limits your > scalability. You can only have up to Q x N nodes in a cluster, and they > will have 1 shard each. Adam@cloudant's talk on BigCounch says it makes > sense to overprovision as long as there are file descriptors to handle it > and uses 64 as an example.
> So I want to pick a really big value for Q, right? Our system is expected > to grow by TB daily, and exceed PB over 25 years. I figure thousands of > shards would be overkill for now, but let us continue to scale as > scientists around the world try to read huge data sets from our multi-PB > store. But…
> Our puny test system was created with only 256 shards (Q=256, N=3, > W=R=2). It crashed and burned on fairly small batch updates (<500KB) on a > 3-node cluster (each with 3 cores, 6gb RAM). I didn't expect Q to have > such an impact on performance, but it really does. As Q goes up, the time > to do the bulk insert goes up dramatically:
> When I started with BigCouch a couple months ago, I thought this would be > able to scale from a couple of regular PCs to a sizable server farm as long > as Q was big enough. Now it looks like you either have to pick Q low > enough to run on those regular PCs (but then you can't scale past a couple > dozen nodes) or you need to start with a big HW investment to support Q > high enough to scale into petabytes.
> Am I missing something? Has anyone else run into this? Why would it take > so much more CPU? Has anyone worked around this?
> And sorry adam@cloudant, this isn't limited by file descriptors. Tried > upping that limit too...
Hi, this is a fair question and worthy of a longer response than I can hack out on a phone. I'll try to turn around something more helpful later today, but in the short term:
1) you are correct. The optimal value for q depends on both DB scale and the resources available.
2) while bigcouch doesn't support automatic resharding, it is standard practice to manually reshard as you grow. Manual here means creating a new database and replicating, so it's not that brutal, nor that manual.
3) bigcouch is built to support both a single massive database or millions of small databases. In the latter case it is routine to have a node count far exceeding n*q.
I'll try to add more specifics later.
Thanks, Mike Miller (Cloudant)
On Nov 10, 2012, at 11:34 PM, Idan Shechter <lesbitc...@gmail.com> wrote:
> I am interested to hear any opinion about it too.
> On Friday, October 12, 2012 12:16:14 AM UTC+2, ooi wrote:
> BigCouch documentation recommends to overshard because you can't (currently) add shards later, and the number of shards limits your scalability. You can only have up to Q x N nodes in a cluster, and they will have 1 shard each. Adam@cloudant's talk on BigCounch says it makes sense to overprovision as long as there are file descriptors to handle it and uses 64 as an example.
> So I want to pick a really big value for Q, right? Our system is expected to grow by TB daily, and exceed PB over 25 years. I figure thousands of shards would be overkill for now, but let us continue to scale as scientists around the world try to read huge data sets from our multi-PB store. But…
> Our puny test system was created with only 256 shards (Q=256, N=3, W=R=2). It crashed and burned on fairly small batch updates (<500KB) on a 3-node cluster (each with 3 cores, 6gb RAM). I didn't expect Q to have such an impact on performance, but it really does. As Q goes up, the time to do the bulk insert goes up dramatically:
> When I started with BigCouch a couple months ago, I thought this would be able to scale from a couple of regular PCs to a sizable server farm as long as Q was big enough. Now it looks like you either have to pick Q low enough to run on those regular PCs (but then you can't scale past a couple dozen nodes) or you need to start with a big HW investment to support Q high enough to scale into petabytes.
> Am I missing something? Has anyone else run into this? Why would it take so much more CPU? Has anyone worked around this?
> And sorry adam@cloudant, this isn't limited by file descriptors. Tried upping that limit too...