Hello. In a nutshell, it is important to choose a shard key whose
value will not always be greater than or less than that of the
document that preceded it, and has enough granularity to be divided
among many shards as your application grows.
Here are some good resources for getting started with sharding.
Forgive me if you have already read some of them:
The Mongo Document "Choosing a Shard Key" provides good details on
what to consider when choosing a shard key:
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
I also recommend reading "Scaling MongoD"B by my colleague Kristina
Chodorow.
http://shop.oreilly.com/product/0636920018308.do This provides an
excellent introduction to sharding as well as more detailed
explanations on the Dos and Donts of choosing shard keys.
Kristina also has a blog post which gives a real-world example of
choosing shard keys. The title is: "How to Choose a Shard Key: The
Card Game"
http://www.snailinaturtleneck.com/blog/2011/01/04/how-to-choose-a-shard-key-the-card-game/
For further reading, please see the Mongo Document on Sharding:
http://www.mongodb.org/display/DOCS/Sharding
This page contains links to many useful documents on the subject,
including Frequently Asked Questions
http://www.mongodb.org/display/DOCS/Sharding+FAQ,
more advanced reading on sharding internals:
http://www.mongodb.org/display/DOCS/Sharding+Internals, and links to
presentations and other materials on the subject.
Finally, there have been several similar questions asked on this forum
on the subject:
http://groups.google.com/group/mongodb-user/search?hl=en_US&group=mongodb-user&q=choosing+shard+key
Now on to your questions:
Choosing a random shard key is appealing at face value because inserts
will go to different shards, just about guaranteeing that a hot-spot
can never form. However, querying documents would become less
efficient. It is best to choose a shard key that your application
will query on. This is probably best explained by example: Imagine
you are storing user names, with documents like the following (I have
excluded _id for brevity):
{random_number:12345, username:marc}
If I shard on username, and search for "marc", mongos will know
exactly which shard the document containing {username:marc} is stored
on, and query only that shard. If I shard on random_number, and query
for {username:marc}, mongos will have to send that query to all
shards. There is more on this in Kristina's "Card Game" link above.
If you have to shard on an MD5 hash (and the recommendation is that
you do not), then the short version is that doesn't really matter if
they are stored as strings or not. There are a few things to take
into consideration:
An MD5 hash stored as a string will take up 37 bytes: 32 bytes for
the hexadecimal value, 4 bytes for its length, and 1 null byte at the
end.
Stored as binary data, it will take up 21 bytes: 16 bytes for the
value, plus an extra 5 bytes to store the subtype. (see
http://bsonspec.org
for more information)
In the grand scheme of things you probably won't see much difference
in your application if the shard key requires an extra 11 bytes.
However, if you want to print out the MD5 tags in your application,
there will be overhead converting them into strings if they are not
already stored that way. This may tip the scales in favor of storing
the MD5 values as strings.
Finally, if the MD5 hashes are stored as strings, prepending a letter
to them would probably not make much of a difference. Chunks would
still be able to be split.
Hopefully this will provide you with a better understanding of
sharding and factors to take into consideration when choosing a shard
key. Good luck!