shard key analysis

293 views
Skip to first unread message

DBTesterOuter

unread,
Mar 2, 2012, 11:58:34 AM3/2/12
to mongodb-user
Can anyone point me to good, in-depth (as nitty gritty as possible)
documentation on choosing a shard key?

I have read some documentation with regards to choosing a shard key
and the results of the choices.

What I do not understand is why choosing a truly random shard key
(such as MD5 or Java UUID) could be
bad. I would think that randomness would be the goal of a good shard
key.

Also, if a random shard key like MD5 is good, then is it good practice
to transform it to a readable string form representing the MD5's hex
value?

In addition, I have read some things regarding how sharding can go
bad. For example, you can keep hitting the same
shard (or possibly the same chunk) if you have a random key like MD5
but prepend a constant string.

Say for example, if I prepended the letter F to an MD5 before
inserting in Mongo (and this would also be my shard key), how would
this affect sharding? Especially in the case where I am doing many,
many inserts in a very short period of time using this scheme.

Thanks.

Marc

unread,
Mar 5, 2012, 3:42:00 PM3/5/12
to mongodb-user
Hello. In a nutshell, it is important to choose a shard key whose
value will not always be greater than or less than that of the
document that preceded it, and has enough granularity to be divided
among many shards as your application grows.

Here are some good resources for getting started with sharding.
Forgive me if you have already read some of them:

The Mongo Document "Choosing a Shard Key" provides good details on
what to consider when choosing a shard key:
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key

I also recommend reading "Scaling MongoD"B by my colleague Kristina
Chodorow.
http://shop.oreilly.com/product/0636920018308.do This provides an
excellent introduction to sharding as well as more detailed
explanations on the Dos and Donts of choosing shard keys.

Kristina also has a blog post which gives a real-world example of
choosing shard keys. The title is: "How to Choose a Shard Key: The
Card Game"
http://www.snailinaturtleneck.com/blog/2011/01/04/how-to-choose-a-shard-key-the-card-game/

For further reading, please see the Mongo Document on Sharding:
http://www.mongodb.org/display/DOCS/Sharding
This page contains links to many useful documents on the subject,
including Frequently Asked Questions http://www.mongodb.org/display/DOCS/Sharding+FAQ,
more advanced reading on sharding internals:
http://www.mongodb.org/display/DOCS/Sharding+Internals, and links to
presentations and other materials on the subject.

Finally, there have been several similar questions asked on this forum
on the subject: http://groups.google.com/group/mongodb-user/search?hl=en_US&group=mongodb-user&q=choosing+shard+key

Now on to your questions:

Choosing a random shard key is appealing at face value because inserts
will go to different shards, just about guaranteeing that a hot-spot
can never form. However, querying documents would become less
efficient. It is best to choose a shard key that your application
will query on. This is probably best explained by example: Imagine
you are storing user names, with documents like the following (I have
excluded _id for brevity):
{random_number:12345, username:marc}
If I shard on username, and search for "marc", mongos will know
exactly which shard the document containing {username:marc} is stored
on, and query only that shard. If I shard on random_number, and query
for {username:marc}, mongos will have to send that query to all
shards. There is more on this in Kristina's "Card Game" link above.

If you have to shard on an MD5 hash (and the recommendation is that
you do not), then the short version is that doesn't really matter if
they are stored as strings or not. There are a few things to take
into consideration:

An MD5 hash stored as a string will take up 37 bytes: 32 bytes for
the hexadecimal value, 4 bytes for its length, and 1 null byte at the
end.

Stored as binary data, it will take up 21 bytes: 16 bytes for the
value, plus an extra 5 bytes to store the subtype. (see http://bsonspec.org
for more information)

In the grand scheme of things you probably won't see much difference
in your application if the shard key requires an extra 11 bytes.
However, if you want to print out the MD5 tags in your application,
there will be overhead converting them into strings if they are not
already stored that way. This may tip the scales in favor of storing
the MD5 values as strings.

Finally, if the MD5 hashes are stored as strings, prepending a letter
to them would probably not make much of a difference. Chunks would
still be able to be split.

Hopefully this will provide you with a better understanding of
sharding and factors to take into consideration when choosing a shard
key. Good luck!
Reply all
Reply to author
Forward
0 new messages