distributing storage?

project2501

unread,

Dec 1, 2011, 10:06:45 AM12/1/11

to mongodb-user

Hi,
I know about mongodb's auto-scaling/sharding and that it is
triggered by storage limits in size.
But can I distribute my document writes evenly across a cluster? In
other words, I want mongos to do something like a round-robin or some
other algorithm to distribute writes across the cluster as they
arrive.

Is it possible?

thanks.

Sergei Tulentsev

unread,

Dec 1, 2011, 10:22:08 AM12/1/11

to mongod...@googlegroups.com

If you're updating a document, this write will go to its shard. If you're inserting a new document, it will go to a shard where its chunk lives. If you want random distribution of new documents, use random shard key, like md5 of something.
But there are some serious drawbacks in this approach. For example, a simple retrieval by _id will go to all shard (unless you specify shard key, which you will have to calculate again).

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

--
Best regards,
Sergei Tulentsev

project2501

unread,

Dec 1, 2011, 10:26:08 AM12/1/11

to mongodb-user

Hi,
Thanks for the tip. If they are randomly inserted into shards, and
evenly distributed.
Then at query time, the id's are returned with no additional penalty,
yes? Because
all the shards got the query.

Is it then possible to know what shard the result id came from and
retrieve it from only that shard?

thanks.

On Dec 1, 10:22 am, Sergei Tulentsev <sergei.tulent...@gmail.com>
wrote:

Sergei Tulentsev

unread,

Dec 1, 2011, 10:31:10 AM12/1/11

to mongod...@googlegroups.com

> Then at query time, the id's are returned with no additional penalty, yes? Because all the shards got the query.

Yeah, this is the penalty. All shards get spammed on each request.
If you supply correct shard key (along with _id), then query will be efficient.

Look at your data, maybe there is another good shard key, that has good distribution and is not artificial.

Marc

unread,

Dec 1, 2011, 2:51:09 PM12/1/11

to mongodb-user

As a general rule, it is best to let the Mongo Balancer handle
managing your chunks for you. This involves choosing a shard key that
has enough granularity that is can be split into many chunks, and is
non-incrementing to prevent creating a "hot-spot" on a single server
and ensure that new data is written evenly among the shards. This is
the "official" 10gen policy.

The Mongo Document "Choosing a Shard Key" provides some good tips and
best practices on the subject.
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key

A big pitfall of choosing a shard key that has the same cardinality as
the number of servers is what to do when those servers fill up.
Consider this example:

"I have four servers, so I am going to create a key, "goto_shard",
with the possible value of 1-4 and shard on that. Each new document
will be assigned a "goto_shard" value of 1-4 in a round-robin
fashion."

This may be okay in the beginning, as each server is filled at the
same rate. However, what will happen when these four servers get
full? A fifth member can be added to the collection, and the range of
"goto_shard" can be expanded to 1-5. However, the chunks on shards
1-4 can not be moved. In order to rebalance the documents, the value
of "goto_shard" will have to be manually changed on all of the
documents that you wish to move to the new shard. To quote "Scaling
MongoDB" by Kristina Chodorow, "If you want to manually distribute
your data, do not use MongoDB's built-in sharding. You'll be fighting
it all the way." This book is a terrific resource for learning about
how sharding works in MongoDB and provides good examples of the "best
practices" to use when creating sharded collections.
http://shop.oreilly.com/product/0636920018308.do

All of that being said, there is a feature request in the pipeline to
allow the user to specify some affinity for which shard the balancer
sends each chunk to. (This is mostly for location awareness, but
could probably be adapted for round-robin style writes) The Jira case
is here:
https://jira.mongodb.org/browse/SERVER-2545 It is tentatively
scheduled for version 2.1.

Other users have asked similar questions regarding round-robin style
writes. You may find the discussion titled "Forcing a Round Robin on
Sharded Collections" beneficial:
http://groups.google.com/group/mongodb-user/browse_thread/thread/b943d60caf3d6347

On the topic of choosing a shard key, for read efficiency, the best
practice is to choose a shard key that your application will be
querying on. This is what Sergei Tulentsev was talking about in
regard to the drawbacks of using a random shard key. If you shard on
the key "name", and run a query on that key, such as:
> db.collection.find({name:"Marc"})
the mongos process will only talk to the shard containing the
document(s) which match the query. However, if you shard on the key
"name" and do a query on the "_id" key all shards will be hit, because
the mongos will not know which shard each specific _id resides on.

In conclusion, it is important to choose a shard key wisely such that
writes will be evenly distributed, and it is strongly discouraged to
try to force which shard each new document is written to yourself.
Hopefully the above has provided you with some insight about how
sharding works. If you have any questions regarding choosing a shard
key or anything else, the MongoDB community is here to help!

Good Luck!

Reply all

Reply to author

Forward