Your data structure does present a tricky situation to shard.
Ideally, the shard key will be a value that can be divisible into many
pieces and also does not increment.
"host", "facility", and "level" all sound like they have finite
values. Therefore, they would not make good shard keys. The pitfall
is, as you pointed out, if one of the shard keys gets used
disproportionally to the others, a giant, immovable chunk will be
created. Additionally, if the size of your application ever requires
more servers than shard keys, you are left with very limited
horizontal scalability.
In a situation like this, some people might be temped to create a
separate field containing a random value and using that for a shard
key. This should be avoided. Firstly, this would be wasting an index
and adding extra overhead to each read operation. Secondly, the shard
key should be a value that is commonly queried on. If the shard key
is random, then each shard will have to be read from on each query,
because Mongos will not be able to know where any of the information
is.
Fortunately, Mongo supports compound shard keys. Generically, a good
solution is to combine a coarsely ascending key plus a search key.
This provides the benefit of having all of the most recent data in
memory as well as an even distribution.
One possibility would be to couple _id, or created_at (which I also
believe is an ascending key?), with whichever of your other keys has
the most divisibility in values.
A colleague of mine suggested creating a non-increasing hash of the
_id (call it hash_id for the purposes of this example) and sharding on
that. This hash_id would be an optimal shard key as it will
distribute writes to all shards effectively equally, no balancing
initially required. The possible drawback of this approach is if the
cluster size is changed, balancing will need to take place, and will
create somewhat weird distributions. Also, every non-hash_id query
will hit every shard, so for data mining the performance will not be
ideal.
My colleague went on to recommend choosing the middle ground between
write performance and querying speed, and creating a compound shard
key on your search key (which I believe to be "host"?) and the
hash_id, mentioned above; Something like : { host : 1, hash_id : 1 }.
This has the advantage that writes will generally go to a number of
shards for a number of hosts (though if only one host is up, for
example, all writes will hit a single shard). If a host begins to
have too much data, the host chunk can be split and migrated via _id
(which increases), so you could have N chunks being written on N
shards. If you then query like : { host : A, hash_id : #, _id : X },
the query itself will be fast as it will go to only a single shard,
and then use the _id index on that shard.
Hopefully this answer has provided you with some additional insight on
how sharding works in MongoDB and has given you a good frame of
reference for choosing your new sharding arrangement. If possible, I
recommend reading Scaling MongoDB by Kristina Chodorow.
http://shop.oreilly.com/product/0636920018308.do This provides an
excellent introduction to sharding as well as more detailed
explanations on the Dos and Donts of choosing shard keys that I
mentioned above.