Distribution of natively hashed keys across shards

64 views
Skip to first unread message

Jameson Lopp

unread,
Oct 14, 2015, 10:47:12 AM10/14/15
to mongodb-user
I'm running an import job on a sharded mongo 3.0 cluster and am running into a few issues:

1) I'm sharding several collections on the _id field, which is already a hash. Thus I'm not telling mongo to hash the _id. However, my inserts end up all being sent to one shard and then have to be moved by the balancer, which slows down the import significantly. Any ideas on why this is the case? 

How does mongo determine the initial key ranges for each shard? I could see this behavior occurring if the initial key ranges cover the entire set of unicode characters and since my _id hashes are all hex, they would only fall into the first bucket... do I need to manually set the key ranges somehow?

2) As a side effect of the first problem, one portion of my import job adds data so quickly that the balancer can't keep up and the very first chunk split fails, resulting in the entire collection being stored into a single chunk.


Based upon my reading of the documentation and posts on this forum, it sounds like I may be able to fix all the problems by simply changing the sharding to use a hashed index; I guess this automatically sets the key ranges appropriately. But from a performance standpoint, wouldn't it be better for me to not require mongo to perform additional hashing and maintain the additional index?

Wan Bachtiar

unread,
Nov 11, 2015, 12:19:47 AM11/11/15
to mongodb-user

Hi Jameson,

MongoDB by default uses ObjectIDs as the default value for the _id field if the _id field is not specified. The value of ObjectID represent a time stamp (most significant bits), which means that they increment in a regular and predictable pattern — not a hashed value.

If you specify your own _id values and have already hashed your _id values (or a field you want to shard on) you can use this as your shard key instead of requesting the server to calculate this for you.

Although, letting MongoDB to create the hashed index for you has an advantage. If you shard an empty collection using a hashed shard key, MongoDB will automatically create and migrate chunks that each shards has two chunks. You can control how many chunks MongoDB will create with numInitialChunks parameter to shardCollection.

For example: even before there is any records in the collection, sharding by a hashed index will give you:

      shard key: { "_id": "hashed" }
      chunks:
        shard01: 2
        shard02: 2
        { "_id": { "$minKey" : 1 } } -> { "_id": NumberLong("-4611686018427387902") } on: shard01 Timestamp(2, 2) 
        { "_id": NumberLong("-4611686018427387902") } -> { "_id": NumberLong("0") } on: shard01 Timestamp(2, 3) 
        { "_id": NumberLong("0") } -> { "_id": NumberLong("4611686018427387902") } on: shard02 Timestamp(2, 4) 
        { "_id": NumberLong("4611686018427387902") } -> { "_id": { "$maxKey" : 1 } } on: shard02 Timestamp(2, 5)

Alternatively if you are using your own hashed values you can manually pre-split chunks for an empty collection before importing any data. You can pre-split chunks on hashed values and non-hashed values. This way the impact of the balancer would only be felt when you add a new shard (and therefore must balance).

Regards,

Wan.

Reply all
Reply to author
Forward
0 new messages