sharding imbalance

78 views
Skip to first unread message

Simeon Zaharici

unread,
Oct 24, 2011, 11:41:31 AM10/24/11
to mongodb-user
Hello

I have a mongodb sharded cluster that is used as a backend for
Graylog2, a log management solution. All the components are 2.0.0
The log data is sent into a big messages collection which is sharded
using the "_id" field. At this point we are using two shards and data
is distributed evenly across the two, they have the same number of
data chunks.. However the primary shard node seems to be receiving all
the writes before they are distributed to the other shard. The cacti
graphs on the primary node show big spikes in Mongo inserts on shard 1
whereas there is little activity on shard two. Consequently the load
is a lot higher on the first shard primary node and almost all the
queries are slow Is this happening because we only have a mongos
process on the log server that is doing the writes ?

Thanks,

Marc

unread,
Oct 24, 2011, 3:07:51 PM10/24/11
to mongodb-user
_id should never be chosen as the shard key, because it is an
ascending value. As new documents are created, each will have a
higher value shard key, and all new documents will be written to the
same shard, creating a hotspot, just like you are observing. When the
shard that is constantly being written to grows significantly larger
than the other shard, the Balancer will kick in and redistribute the
chunks, causing extra read and write operations.

It is important to choose a shard key whose value will not always be
greater than or less than that of the document that preceded it, and
has enough granularity to be divided among many shards as your
application grows.

The Mongo "Sharding Introduction" document has more details on how
documents are stored in a sharded collection.
http://www.mongodb.org/display/DOCS/Sharding+Introduction

Additionally, the Mongo Document "Choosing a Shard Key" provides good
details on what to consider when choosing a shard key.
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key

Unfortunately, there is no quick and easy way to change a shard key
once a collection has been sharded. The Mongo document on the subject
recommends creating a new pre-sharded collection and import your
documents into it.
http://www.mongodb.org/display/DOCS/Changing+a+Shard+Key

If you can give some more details about the information that you are
storing in your documents, and what types of reads and writes your
application will be doing the most, the MongoDB community can provide
some additional advice on choosing a new shard key.

Simeon Zaharici

unread,
Oct 24, 2011, 4:12:13 PM10/24/11
to mongodb-user
Thanks a lot for the reply, I was starting to suspect this might be
the cause.
The reason we chose the _id key is that it was the only field that
would provide an even distribution of the log messages. Here is an
example document from the messages collection

{ "_id" : ObjectId("4ea26dede4b0bebba6fc5c7a"), "message" :"log text
message", "full_message" : null, "file" : null, "line" : 0, "host" :
"host1", "facility" : "local7", "level" : 6, "created_at" :
1319267819, "deleted" : false, "streams" :
[ ObjectId("4e84c69459305b16a300000e") ] }

Placing the sharding key on the host field will create some imbalance
because some hosts send more logs than others and facility, level,
stream or created_at fields will also never be balanced.
What would you guys suggest as a sharding key in this situation ?
The application is constantly inserting logs in the collection and
there is a web component that does ad-hoc find operations based mostly
on indexed fields such as created_at, host and "streams".
Thanks,


On Oct 24, 3:07 pm, Marc <m...@10gen.com> wrote:
> _id should never be chosen as the shard key, because it is an
> ascending value.  As new documents are created, each will have a
> higher value shard key, and all new documents will be written to the
> same shard, creating a hotspot, just like you are observing.  When the
> shard that is constantly being written to grows significantly larger
> than the other shard, the Balancer will kick in and redistribute the
> chunks, causing extra read and write operations.
>
> It is important to choose a shard key whose value will not always be
> greater than or less than that of the document that preceded it, and
> has enough granularity to be divided among many shards as your
> application grows.
>
> The Mongo "Sharding Introduction" document has more details on how
> documents are stored in a sharded collection.http://www.mongodb.org/display/DOCS/Sharding+Introduction
>
> Additionally, the Mongo Document "Choosing a Shard Key" provides good
> details on what to consider when choosing a shard key.http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
>
> Unfortunately, there is no quick and easy way to change a shard key
> once a collection has been sharded.  The Mongo document on the subject
> recommends creating a new pre-sharded collection and import your
> documents into it.http://www.mongodb.org/display/DOCS/Changing+a+Shard+Key

Marc

unread,
Oct 25, 2011, 12:11:14 PM10/25/11
to mongodb-user
Your data structure does present a tricky situation to shard.
Ideally, the shard key will be a value that can be divisible into many
pieces and also does not increment.

"host", "facility", and "level" all sound like they have finite
values. Therefore, they would not make good shard keys. The pitfall
is, as you pointed out, if one of the shard keys gets used
disproportionally to the others, a giant, immovable chunk will be
created. Additionally, if the size of your application ever requires
more servers than shard keys, you are left with very limited
horizontal scalability.

In a situation like this, some people might be temped to create a
separate field containing a random value and using that for a shard
key. This should be avoided. Firstly, this would be wasting an index
and adding extra overhead to each read operation. Secondly, the shard
key should be a value that is commonly queried on. If the shard key
is random, then each shard will have to be read from on each query,
because Mongos will not be able to know where any of the information
is.

Fortunately, Mongo supports compound shard keys. Generically, a good
solution is to combine a coarsely ascending key plus a search key.
This provides the benefit of having all of the most recent data in
memory as well as an even distribution.

One possibility would be to couple _id, or created_at (which I also
believe is an ascending key?), with whichever of your other keys has
the most divisibility in values.

A colleague of mine suggested creating a non-increasing hash of the
_id (call it hash_id for the purposes of this example) and sharding on
that. This hash_id would be an optimal shard key as it will
distribute writes to all shards effectively equally, no balancing
initially required. The possible drawback of this approach is if the
cluster size is changed, balancing will need to take place, and will
create somewhat weird distributions. Also, every non-hash_id query
will hit every shard, so for data mining the performance will not be
ideal.

My colleague went on to recommend choosing the middle ground between
write performance and querying speed, and creating a compound shard
key on your search key (which I believe to be "host"?) and the
hash_id, mentioned above; Something like : { host : 1, hash_id : 1 }.
This has the advantage that writes will generally go to a number of
shards for a number of hosts (though if only one host is up, for
example, all writes will hit a single shard). If a host begins to
have too much data, the host chunk can be split and migrated via _id
(which increases), so you could have N chunks being written on N
shards. If you then query like : { host : A, hash_id : #, _id : X },
the query itself will be fast as it will go to only a single shard,
and then use the _id index on that shard.

Hopefully this answer has provided you with some additional insight on
how sharding works in MongoDB and has given you a good frame of
reference for choosing your new sharding arrangement. If possible, I
recommend reading Scaling MongoDB by Kristina Chodorow.
http://shop.oreilly.com/product/0636920018308.do This provides an
excellent introduction to sharding as well as more detailed
explanations on the Dos and Donts of choosing shard keys that I
mentioned above.
Reply all
Reply to author
Forward
0 new messages