256GB Sharding Limit

519 views
Skip to first unread message

Ben McCann

unread,
Mar 21, 2014, 12:58:21 PM3/21/14
to mongod...@googlegroups.com
The docs say the following:

Sharding Existing Collection Data Size
For existing collections that hold documents, MongoDB supports enabling sharding on any collections that contains less than 256 gigabytes of data. MongoDB may be able to shard collections with as many as 400 gigabytes depending on the distribution of document sizes. The precise size of the limitation is a function of the chunk size and the data size.

What happens if you try to shard after having passed this limit? Will Mongo refuse it and throw an error? Or is it just highly suggested to do so before then to avoid heavy load during rebalancing? Could we set it up as a just a single shard before we get to 256GB and then setup an additional shard and rebalance awhile after we've gone past this limit?

Thanks,
Ben

Glenn Maynard

unread,
Mar 21, 2014, 6:36:43 PM3/21/14
to mongod...@googlegroups.com
I haven't seen that in the docs.  256 GB of data is very small.  Many archival workloads (lots of data, where only recent data is accessed frequently) would want to put tens of terabytes on each shard, so people may not bother with sharding until well into the TBs.

Why is there any limitation at all?

--
Glenn Maynard

Ben McCann

unread,
Mar 21, 2014, 6:43:19 PM3/21/14
to mongod...@googlegroups.com
Here's the page I was referring to:


--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/6W1VhKouaTo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
about.me/benmccann

Asya Kamsky

unread,
Mar 21, 2014, 8:50:26 PM3/21/14
to mongodb-user
Here is what's behind the limitation (this is described in a related Jira ticket but I'll summarize).

When you initially shard a collection, mongos asks mongod to figure out the split points to split existing "chunk" (which is minKey to maxKey, or all documents) into as many chunks as necessary given the chunk size.

It happens that the maximum number of split points that mongod can return is 8,192.  With default chunk size of 64MB and that maximum split points, you can shard about 400-500GB of data.   

So what happens if you want to shard a collection that has 800GB or 2TB of data?   Well, you need to use a work-around.  A simple one would be to temporarily raise the chunk size - for example, if you make it 128MB then you'll be able to shard a collection with up to 800-900GBs of data.  Once initial split points are done you will lower the chunk size back to default and the mongos will end up splitting the "larger" chunks when it next checks them for whether they need to be split.

In general though I would encourage folks not to wait till their collection is really huge because, as an example, if you have 500GB collection that you want to shard and you have two shards, then 250GB of data will need to be moved between shards to balance the collection.  That will take some time and will add to the load on your cluster.   So doing it earlier is a good idea regardless.

Asya



You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.

s.molinari

unread,
Mar 22, 2014, 10:38:13 AM3/22/14
to mongod...@googlegroups.com
Although this isn't my thread, I still learned something new and very interesting once more.:-)

Thanks Asya!

Scott

Ben McCann

unread,
Mar 22, 2014, 2:46:50 PM3/22/14
to mongod...@googlegroups.com
Thanks Asya. That's an incredibly helpful description!


--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/6W1VhKouaTo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Glenn Maynard

unread,
Mar 24, 2014, 10:38:01 AM3/24/14
to mongod...@googlegroups.com
On Fri, Mar 21, 2014 at 7:50 PM, Asya Kamsky <as...@mongodb.com> wrote:
Here is what's behind the limitation (this is described in a related Jira ticket but I'll summarize).

When you initially shard a collection, mongos asks mongod to figure out the split points to split existing "chunk" (which is minKey to maxKey, or all documents) into as many chunks as necessary given the chunk size.

It happens that the maximum number of split points that mongod can return is 8,192.  With default chunk size of 64MB and that maximum split points, you can shard about 400-500GB of data.   

So what happens if you want to shard a collection that has 800GB or 2TB of data?   Well, you need to use a work-around.  A simple one would be to temporarily raise the chunk size - for example, if you make it 128MB then you'll be able to shard a collection with up to 800-900GBs of data.  Once initial split points are done you will lower the chunk size back to default and the mongos will end up splitting the "larger" chunks when it next checks them for whether they need to be split.

In general though I would encourage folks not to wait till their collection is really huge because, as an example, if you have 500GB collection that you want to shard and you have two shards, then 250GB of data will need to be moved between shards to balance the collection.  That will take some time and will add to the load on your cluster.   So doing it earlier is a good idea regardless.

250 GB of data on two shards is a lot more expensive than 500 GB of data on a single shard.  With access patterns that have terabytes of data but only frequently access small parts of it, you're not I/O or CPU-bound against the size of the whole data set, so sharding at that small a scale is just an extra expense.

That said, is this specifically only a problem when initially switching a collection over to be sharded?  If so, another workaround could be to enable sharding on the collection, without actually giving it more than one shard to work with.  (That also helps ensure that sharding issues are taken care of earlier--if you inadvertently try to modify the shard key, use JS queries, or other things that aren't supported with sharding, it'll let you know.)

--
Glenn Maynard

Asya Kamsky

unread,
Mar 24, 2014, 1:41:49 PM3/24/14
to mongodb-user
> That said, is this specifically only a problem when initially switching a
> collection over to be sharded? If so, another workaround could be to enable
> sharding on the collection, without actually giving it more than one shard
> to work with. (That also helps ensure that sharding issues are taken care
> of earlier--if you inadvertently try to modify the shard key, use JS
> queries, or other things that aren't supported with sharding, it'll let you
> know.)

100% correct - this is only an issue with initial split when you enable sharding
on a collection. Absolutely this can be done when is only one shard, though
that would normally only be done when you *know* you are going to be
sharded soon/eventually.

Asya

s.molinari

unread,
Mar 25, 2014, 4:26:14 AM3/25/14
to mongod...@googlegroups.com
Again, this is great stuff.

Are there any kind of basic indicators or methods that one could use as a "now it is time to start to shard a collection" or even to "start to think about sharding"? I mean, setting up for sharding on a whim that things might grow to a necessary sharding potential, would be a bit of an overkill, right? 

Sorry for going somewhat off-topic, but obviously, the source of many sharding issues is making the decision to shard at the wrong time (too late) and I've read this has happened to several fast growing organisations and it caused them real pain.

So, are there any smart and basically standard methods or indicators to avoid missing the proper time to make sharding decisions and take sharding action?

Scott  

Glenn Maynard

unread,
Mar 25, 2014, 11:03:03 AM3/25/14
to mongod...@googlegroups.com
It seems like it's best to set up sharding early, even from the beginning.  It seems like it'll save headaches later, by setting these things up while your application is lightly loaded and doesn't have as much data.  You need to figure out your shard key at the start, but you should do that anyway.  Down the road, when your system is heavily loaded and you need to add another shard, you're just scaling up things you configured earlier instead of making a big configuration change.

--
Glenn Maynard

Reply all
Reply to author
Forward
0 new messages