Size of a chunk in the case of a hashed shard key.

227 views
Skip to first unread message

Xaltos

unread,
Apr 9, 2015, 8:08:57 AM4/9/15
to mongod...@googlegroups.com
Hello all,

I have a problem with the balancing of a big sharded MongoDB system.

My current Setup:
                1 big Collection with  ~ 1.7G documents
                7 Shards       shard key: { "_id" : "hashed" }
                ~700 Chunks
                Mongo DB Version : 2.4

The balancing of the chunks worked well , so each Shard has ~ 100 Chunks.
But the amount of documents on the 7 shards is very uneven.
The biggest has nearly 10 times more documents than the smallest one.

 shard_1   62M documents
shard_2  581M documents    <-- to many
shard_3  178M documents
shard_4  636M documents
shard_5   69M documents     <-- to many
shard_6   99M documents
shard_7   68M documents

 
I found already some hints about the Split Chunk in combination with the bounds parameter .
But I have a problem to find the current size of the different chunks.

The command  sh.status({verbose:true})   shows me the ranges of chunks.

e.g.
  { "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-9220724471370054775") } on : rs_shard-1 Timestamp(174, 10)
  { "_id" : NumberLong("-9220724471370054775") } -->> { "_id" : NumberLong("-9217729578432084058") } on : rs_shard-1 Timestamp(174, 11)
   …

From my understanding are the min/max values the hashed Id numbers.
And I can’t use this numbers inside of a normal .find().count() commando for a deeper analyses.

Can I use the hashed numbers somehow inside of a query?
Or can I access the raw Index data somehow?

Regards
    Xaltos

Ketul

unread,
Apr 9, 2015, 9:07:16 AM4/9/15
to mongod...@googlegroups.com

hi,
 
all shards on single machine?

Stephen Steneker

unread,
Apr 9, 2015, 9:24:32 AM4/9/15
to mongod...@googlegroups.com
On Thursday, 9 April 2015 22:08:57 UTC+10, Xaltos wrote:
I have a problem with the balancing of a big sharded MongoDB system.

My current Setup:
                1 big Collection with  ~ 1.7G documents
                7 Shards       shard key: { "_id" : "hashed" }
                ~700 Chunks
                Mongo DB Version : 2.4

The balancing of the chunks worked well , so each Shard has ~ 100 Chunks.

Hi Xaltos,

The balancing algorithm is based on chunks, so it sounds like things are working as expected at a high level:
  http://docs.mongodb.org/manual/core/sharding-balancing/

But the amount of documents on the 7 shards is very uneven.
The biggest has nearly 10 times more documents than the smallest one.

The balancing uses chunks as a proxy for a range of documents; by default each chunk represents about 64Mb of data.

There are a few reasons to have differing numbers of documents per shard, but given the significant differential I suspect this may be the result of deleting large ranges of data leaving chunk ranges containing few (or no) documents.

FYI, MongoDB 2.6+ includes a mergeChunks command which you can use to combine contiguous chunk ranges on a shard: http://docs.mongodb.org/manual/reference/command/mergeChunks/. This isn't helpful given you are still on 2.4, but might be a reason to upgrade if it turns out that you have a lot of sparsely populated or empty chunks.

If you want to estimate (or actually calculate) the size of documents in each chunk range, there's a script you can try in Adam's response to a similar question on DBA StackExchange: http://dba.stackexchange.com/questions/52416/.

Regards,
Stephen

Xaltos

unread,
Apr 10, 2015, 5:56:57 AM4/10/15
to mongod...@googlegroups.com
 
Many thanks for the helpful feedback.
The Script from the link works out of the box without problems. 
 
Our system is a fresh system and we imported the last 14 Days a lot of data from our backup files.
The Import finished some days ago, but it looks like the balancer has still a lot of work to do.
 
I found chunks with a lot more data than the expected 64 MByte.    
And this is the reason for the uneven  amount of documents.
 
e.g. biggest chunk has 31 GByte, so this could be splitted into 484 new chunks with 64 MByte  each.
I fear this kind of work need weeks till the system splits all the to big chunks and balance them to other shards.
 
Many thanks again.
   Xaltos
 
 
 
 
 
 
 
 

Stephen Steneker

unread,
Apr 10, 2015, 7:00:58 AM4/10/15
to mongod...@googlegroups.com
On Friday, 10 April 2015 19:56:57 UTC+10, Xaltos wrote: 
Many thanks for the helpful feedback.
The Script from the link works out of the box without problems. 
 
Our system is a fresh system and we imported the last 14 Days a lot of data from our backup files.
The Import finished some days ago, but it looks like the balancer has still a lot of work to do.

Hi Xaltos,

FYI, if you are loading data into an empty collection a common strategy to minimize or avoid re-balancing is to pre-split the chunks given you can calculate the data distribution.

For a hashed shard key you can specify "numInitialChunks" to control the number of chunks when the collection is sharded:

For a non-hashed shard key you would pre-split based on the data:


 I found chunks with a lot more data than the expected 64 MByte.
 
And this is the reason for the uneven  amount of documents.
 
e.g. biggest chunk has 31 GByte, so this could be splitted into 484 new chunks with 64 MByte  each.

When you imported the data were all of your config servers online? 

Normally you should not see such massive chunks unless the cluster metadata was readonly (i.e. one or more config servers down) or you have a shard key which does not have enough granularity. Given you mentioned using a hashed shard key on _id, that should eliminate granularity as a likely cause.

A few questions:

 - When you restored the data, how many mongos did you insert via?

 - What exact version of MongoDB 2.4.x is this?

 - Can you check if these chunks have been flagged as jumbo chunks (aka unsplittable):

  use config;
  db
.chunks.find({jumbo:true});

It's possible to manually split chunks (http://docs.mongodb.org/v2.4/tutorial/split-chunks-in-sharded-cluster/), but it would be helpful to first understand why you have such large chunks to begin with.

Regards,
Stephen

Xaltos

unread,
Apr 13, 2015, 5:41:25 AM4/13/15
to mongod...@googlegroups.com
 

Hello again,

here the answers to your questions:



 - When you restored the data, how many mongos did you insert via?


We used just one mongorestore process for the loading of the data.

But our application was already active during this time, so one Mongos process wrote also all 10 min some data into the db.

So 99% of the data have been inserted from the same process.


 
 - What exact version of MongoDB 2.4.x is this?


We are using  2.4.13

The update to Mongo 3 is already on the schedule, but not with the highest priority.


 
 - Can you check if these chunks have been flagged as jumbo chunks (aka unsplittable):


We have no jumbo chunks right now. But I have seen several of them during the import process.



I am tracing the amount of chunks on every shard since last Friday, so I get a better understanding of the activities.

     Friday 15:00   795 chunks  ==>   Monday 10:00    831 chunks

 

So the balancer is working, but it not very fast and midget need a very long time for the fixing of the problem.

 

The tracing of the amount of documents on each shard shows also an interesting effect.

The 2 big servers still grow faster than the other servers.

I think this works like gravity.  The uneven Chunks with a too big range collect also a big part of the new data.  And the balancer process is currently to slow to compensate it.

 

I will test a manual split as a first step, and if this isn't good/fast enough start with a fresh import.

Thanks for your support.
    Xaltos


 
Reply all
Reply to author
Forward
0 new messages