Sharding - Unevenly distributed documents with perfectly evenly distributed chunks.

Andre

unread,

Apr 20, 2014, 1:23:16 PM4/20/14

to mongod...@googlegroups.com

I'm stuck and need help, I'm not sure if anyone has seen and resolve this problem. We have TB's of data with multiple collections that are sharded across on DB v2.4. All collections are perfectly distributed across shards except for one as you can see below. Which shows that the chunks are evenly distributed, however, the documents are way off. The problem is only on this single collection and I really need to figure a way to fix it asap as adding more shards does not help much as the first shard always maintains the higher set. Any help will be greatly appreciated. This collection was recently sharded using a compound index with about 4 fields.

Shard db1shard1 at db1shard1/db1shard1a:27017,db1shard1b:27017

data : 336.38Gb docs : 140688342 chunks : 1682

estimated data per chunk : 204.78Mb

estimated docs per chunk : 83643

Shard db1shard2 at db1shard2/db1shard2a:27017,db1shard2b:27017

data : 75.87Gb docs : 32062992 chunks : 1682

estimated data per chunk : 46.18Mb

data : 412.25Gb docs : 172751334 chunks : 3364

Shard db1shard1 contains 81.59% data, 81.43% docs in cluster, avg obj size on shard : 2kb

Shard db1shard2 contains 18.4% data, 18.56% docs in cluster, avg obj size on shard : 2kb

Asya Kamsky

unread,

Apr 20, 2014, 11:56:43 PM4/20/14

to mongod...@googlegroups.com

what is the shard key fields - is there high enough granularity there?

you might also check if you have any jumbo chunks on the first shard, try:

mongos> use config

mongos> db.chunks.count({jumbo:true})

What's the count? If it's high you might also try:

mongos> db.chunks.aggregate({$group:{"_id":"$shard",jumbos:{$sum:{$cond:["$jumbo",1,0]}}}})

Output to this might help figure out what's going on here.

Asya

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/3ebcfd59-13bd-4408-92ab-c2f9c2f470ac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andre

unread,

Apr 21, 2014, 9:14:46 AM4/21/14

to mongod...@googlegroups.com

Hi Asya,

First off, thanks so much for replying, I'm assuming shard key is the problem or balancer, its currently (application id[objectid], date[YYYYMMDD],event[arbituary string value]) - index. Is it possible that larger applications are all going into a single shard? This is the only collection with this type of key, every other collection I use _id(application id and random characters) and they balanced perfectly.

I just realized that we have this same problem in our test environment and did not noticed it before we sharded the collection, we all assumed that after trying multiple keys, the one with the most balanced chunks is what we need to scale. we never assumed that the chunks will be balanced but the documents could be opposite so we didn't check documents count and even now I still don't understand how this is even possible.

I'm not sure what high means but after running db.chunks.count I got 825 but I'm not sure what this means and what is high, is this in relation to the total chunks?

this is what I got for db.chunks aggregate. Although I don't know what that means, it seems db1shard1 has all the jumbos? please let me know what I need to do, i'll be looking forward to your response.

{

"result" : [

{

"_id" : "db1shard1",

"jumbos" : 825

},

{

"_id" : "db1shard2",

"jumbos" : 0

}

],

"ok" : 1

}

Andre

To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.

Asya Kamsky

unread,

Apr 21, 2014, 7:13:23 PM4/21/14

to mongodb-user

For jumbo chunks high is anything greater than 0.

Jumbo are chunks that could not be split when they reached above the maxChunkSize (64MB by default) and the reason chunks might not be splittable is if the shard key is not granular enough.

Here's an example:

You shard key is applicationId[objectid], date[YYYYMMDD],event[arbituary string value])

Let's say that on a particular day (20130228) a particular application (with Id ObjectId("512fb5dca31a92aae2a214b9")) had a very large number of events for "eventFooBar98". In fact, imagine that all those documents on that day totaled way more than 64MBs. Say it was 450MBs and now the balancer went to move this chunk from shard1 to shard2.

Oh no, this chunk is too big to move!

So the balancer tells the shard: "Hey, you need to split this chunk, it's too big to migrate as it is now!"

But the shard says: "I cannot split this chunk, it only represents a single value of the shard key as it is!"

So then everyone agrees: "Okay, we can't split it, and it's too big to move, so let's mark it JUMBO in the config so that we will not keep trying to split and move it again".

Okay, so that's how you get a jumbo chunk.

Now, what should the balancer do to continue balancing the cluster? You probably guessed it - it moves on to the next chunk.

If it's not too big to move, then it gets migrated.

So what ends up happening is eventually the number of chunks is even but sadly 285 chunks left on shard1 are jumbo and they are contributing WAY more than their average share of documents to the size of the shard...

Now you know why we say you should make sure your shard key is "granular" enough or has high "cardinality" - it's to make sure that chunks can be split into smaller chunks...

So now that you know how that's possible, the question is "what should you do about it?" or rather what *can* you do about it?

I guess I would need to know more about your use case - you said you tried multiple shard keys - what were some of the other ideas you had and why did they get rejected? There's a lot of reading on picking a good shard key, but I know it can be rather overwhelming to take in all at once.

Asya

To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.

To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/2fde9924-abce-410f-ac68-a1b0791a2996%40googlegroups.com.

Andre

unread,

Apr 21, 2014, 8:08:29 PM4/21/14

to mongod...@googlegroups.com

Asya

I couldn't have asked for a better explanation - you are awesome! It is definitely a mistake that we made on our side as we wanted to use an existing index or a hashed key and simply stopped at the first key that made the chunks looked equal. However, we failed to realize that equal chunks doesn't mean evenly distributed documents. We've been using Mongo since it was conceived and keep learning everyday - sharding is just new to us. We had to shard quickly and there was not enough information available to clearly explain shard keys and how to identify a good shard key, definitely was not aware of jumbo. There are a lot but not definite explanation - we got lost, wish there was a shard key evaluator of some sort but at least we know what to look for now. we were however aware of consequences of choosing the wrong shard key as that was clear in all docs.

The problem we are faced with now, is that we have TB's of data and any sort of migration takes hours or days. Since this shard was running out of space, we quickly added a new SSD which bought us 30% more space, so we have a few days to do something about it and no room to add more space.

At this point we have the standard _id that we could hash, we have created date field(utc), we also have other fields that have worked well as shard keys in other collections(_id) since this has some relational data. However, out of 50+ fields, the following fields are the only fields that are guaranteed to be there. (_id, appid, account id(objectid), create date utc, yyyymmdd, event name(string)) everything else is completely optional.

What would you suggest?

Andre

Asya Kamsky

unread,

Apr 23, 2014, 2:28:40 PM4/23/14

to mongodb-user

Andre,

Sounds like you already know the requirements for a shard key:

Must be set on insert.

Cannot be changed (value cannot be updated).

Should be present in majority of your queries (to achieve targeted reads where queries go to only one shard)

Should distribute your documents relatively evenly across the entire range of its possible values (to distribute writes as evenly as possible among all your shards)

Should be granular enough to allow chunk splits to happen as the data grows...

Ultimately, you are the expert in what your data is, what your requirements are and how you access the data - for this reason you are in position to figure out which fields will work best for the shard key.

But if you want to run your candidates by this list of requirements and "suggestions" and then follow-up here about whether something might be a good shard key ... We can try to help as much as we can... (given we don't know nearly as much about your system as you do!)

Asya

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/520e5a97-c1b5-4a6d-8ae3-d57242a2195d%40googlegroups.com.

Andre

unread,

Apr 24, 2014, 3:25:56 AM4/24/14

to mongod...@googlegroups.com

Thanks Asya, I did a local test with about 30GB's of data using hashed _id(ObjectID) and that balanced it with a 49+% and 50+% documents and almost equal chunks. Using the _id(hashed) seems to satisfy all except the targeted reads, this seems to be a common dilemma I've seen around, that most people can only create shard keys for heavy reads or writes not both. For us we have constant heavy writes so we definitely want to keep that flowing. We do have high reads because of our schedule ad-hoc reporting and map/reduce jobs, hundreds and these only grow as our customer database grows. Our customers can wait a bit for reports to be created but definitely can afford to lose data or imbalance cluster that will cause downtime.

It will be great to have a perfect key with balanced writes and reads but as you mentioned, this key must(should) be in all queries for it to be beneficial and the only keys that are require on all reads are (ApplicationID(ObjectId), Date Range(YYYYMMDD)) - other than that everything else is optional. From our current mistake i'm assuming the combination of these will be a bad key since all events for a single app on that day will go into a single chunk.

Will hashing those keys be better or reversing order or should I just stick with hash _id and gain on evenly distributed writes and have MapReduce deal with finding documents for reporting?

here is the schema of what the minimum document looks like.

{

"_id" : ObjectId("52b382ba96da760da00df869"),

"aid" : ObjectId("4dde483624023247aca3ace8"), -->account id

"appid" : ObjectId("5258365365a7f22224de9f17"),

"c" : {

"ev" : {

"n" : "<name of event>",

},

"dto" : {

"ymd" : 20131124

},

"_co" : ISODate("2013-11-24T22:29:27.000Z")

}

but as mentioned only appid & dto.ymd are required for reads using the following index (appid_1_dto.ymd_1_c.ev.n_1)

Asya Kamsky

unread,

Apr 26, 2014, 9:44:48 PM4/26/14

to mongodb-user

Andre,

Interesting - I wouldn't dismiss a non-hashed key quite so quickly... If reads are specifying app_id,date range those would be targeted if you sharded on app_id, date but the question would then be how well will that balance the writes?

I'll give you an analogous example: Twitter.

When the writes come in, the most important thing is to not lose the writes and to ingest them as quickly as possible. That means they should be distributed evenly across the cluster so that you can add machines as the load goes up and gain capacity.

Now, how do the reads happen? You might read one person's timeline, you might read multiple people's timelines, but you're always reading the most recent data (almost exclusively).

So you want to shard by something like user, created_date:

for writes (single user can't tweet thousands of messages in the same second and usernames or _ids can be evenly distributed).

for single user reads - when you ask for my timeline it will be a targeted query (goes to my shard) and having created date adds data locality as recent tweets are guaranteed to be in one chunk even if I tweet so much that my data didn't fit in a single chunk (and it was split by date).

But what about the most common reads in Twitter - people wanting to read the timeline of all of their friends, or all of the weeks on a particular keyword or hashtag?

Well, a few things - yes it will be a scatter gather query. But does the data have to come back strictly in order? Nope.

Does it matter if you only return data from one shard or ten (out of hundreds) of shards? Again, no. The most important thing is to return some (correct) data *fast*. But all the data? I don't think anyone would ever want (or expect) that.

So for something like twitter the trade-off is easy in a conflict like that - writes are important but individual reads are basically not. Or rather they are somewhat interchangeable.

In your case it looks like the writes must get balanced well, but the reads - generally reports and map/reduce type jobs are *not* sensitive to latency and in fact you might want to off load them to designated secondaries (if you can) so that they don't slow down the writes (they might cause "reporting" secondaries to lag sometimes, but if they are okay with stale data as historical reports frequently are then it's no big deal).

You want to be careful with scatter-gather map/reduce jobs or long running unindexed queries over all your data - they will basically mess with your working set and RAM utilization on every single shard!

Luckily they are frequently not required to complete too fast. By the way, if you are going to move to 2.6 soon, maybe you can replace some of the map/reduce jobs by aggregation framework - it will be a lot faster.

Asya

To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/f931aed7-458b-43fc-b00f-85db8fdf129e%40googlegroups.com.

Andre

unread,

May 9, 2014, 1:45:43 PM5/9/14

to mongod...@googlegroups.com

Asya,

Thanks, sorry for late response, was out of country. Yes, app_id & date were targeted but horribly balanced, I mean the other shards were barely utilized were the first was almost completely out of disk space. For now, our highest priority is to swallow as much writes as possible and evenly balanced them across all shards and preferable stay away from jumbo chunks. At this time MR runs really fast and I don't think customers will mind waiting a few more minutes for scheduled jobs during a scatter gather.

Our UI is the only piece that reads recent data when a customer logs in - 2 weeks worth but its barely used. Most customers schedule jobs that will pull months of data and email it to themselves weekly. So recent data is not our priority when it comes to MR - all of it is. There are however reads that are sensitive to latency but those are pulled single record at a time by _id so that is not a problem.

Our MR jobs pulled indexed data and filter from within the map function and yes we use secondaries for that which are usually at 20% RAM use so I think we are good so far regarding secondary working set and scatter/gather.

Yes, we do plan on replacing all MR jobs with Aggr, the only reason for using MR was because we couldn't dump the result into a collection but with 2.6 we can, so we will upgrade as soon as we resolve the balancing issue and run some tests.

using a hashed shard key, I get the following with no jumbos.

Shard shard0000 at localhost:10001

data : 18.44GiB docs : 8833037 chunks : 402

estimated data per chunk : 46.98MiB

estimated docs per chunk : 21972

Shard shard0001 at localhost:10002

data : 18.83GiB docs : 9022737 chunks : 401

estimated data per chunk : 48.11MiB

estimated docs per chunk : 22500

Totals

data : 37.28GiB docs : 17855774 chunks : 803

Shard shard0000 contains 49.46% data, 49.46% docs in cluster, avg obj size on shard : 2KiB

Shard shard0001 contains 50.53% data, 50.53% docs in cluster, avg obj size on shard : 2KiB

What do you think, from what we've discuss it looks like this is what I need - do you think this will work. I'm going to run this one more time to be sure but I'll like to fix production and I'll hate to have to rebuild this collection again, as even with SSD's it takes 2 days or more.

Andre

Asya Kamsky

unread,

May 13, 2014, 5:19:17 AM5/13/14

to mongodb-user

Andre,

This will definitely work to give you an even distribution, and it
sounds like the small loss of efficiency won't be an issue due to the
types of queries are are most prevalent.

Of course, the key is to test (as you are doing) and then to monitor
performance and keep up with your capacity planning as your load
increases.

Asya

> https://groups.google.com/d/msgid/mongodb-user/a6456a45-be94-41b6-8157-42ee5c487eea%40googlegroups.com.

Andre

unread,

May 13, 2014, 9:28:15 AM5/13/14

to mongod...@googlegroups.com

Thanks Asya, yes, I mongodumped the collection, drop it, and re-shard collection with hashed(_id) and it seems to be perfectly balance, however, OMG, mongorestore is ridiculously slow. it took about 3 hours to do a mongodump but 1 day into it i'm about 12% with mongorestore - that is very unexpected. Why so slow? We debated whether we should go with mongod's as we dumped or mongos and we settled with mongos to be safe but gracious its slow. I know there is the ability to presplit but that is another thing that is not very well documented and to prevent another catastrophic mistake we decided to not do so since we don't have a full understanding of it. Anyway, we now know for sure that we will do everything in our power to never use mongorestore again. It looks like its only pushing 10MB at a time, this is going to take a week. Anything I can do to speed it up, will turning off balancer help, is on because we don't think there is enough space to fit it all into a single shard and balance later - so we wanted some balancing going on as we restore.

Secondly, what happens if you break mongorestore and re-run - will it skip/error on all the documents that were already inserted and continue, the documentation seems to state so but just wanted to double check as we were considering running local mongos in the hopes it will gain some speed and eliminate the initial network call.

Yes, we've learned our lesson, we will also test shard keys a billion times before deploying.

Andre.

Asya Kamsky

unread,

May 13, 2014, 12:39:38 PM5/13/14

to mongodb-user

If you shard on hashed index, then pre-splitting and pre-balancing
happens automatically.

In other words, the balancer should not be doing anything during the
load, assuming that the sharding of the collection went without an
issue.

What version is this exactly? Can you show what the sh.status()
output for this collection is?

Part of the problem of course is that mongorestore is basically a
single process, if you had the dump split into multiple "dumps" you
could be loading them in parallel through multiple mongos (or even
through one)...

Come to think of it, you can speed it up by starting a second mongorestore ...

You can figure out which _id value is about half way (or a little
past, or maybe way towards the bottom of the file, as a test) where
the running-mongorestore is and start a second one with the --filter
option with query {_id:{$gt:<that _id value>} } which will start that
mongorestore from a later point in the bson file.

http://docs.mongodb.org/manual/tutorial/backup-with-mongodump/#backup-restore-filter

Since mongorestore does not overwrite existing documents, it's safe to
do it that way, and if it's working, the second mongorestore will be
restoring from that point till the end of file, and eventually the
first one will get to the point where second one started and then
you'll start seeing "duplicate _id value" in the *MONGOD* logs (so
you'd have to keep an eye out for that and stop the original
mongorestore then).

You probably want to make sure that you give the second mongorestore
--noIndexRestore flag and that you *don't* specify the drop option
(which you should *not* have specified with the first/currently
running mongorestore either, of course).

Asya

> https://groups.google.com/d/msgid/mongodb-user/c571722f-5fac-445e-a702-2640537dcc5e%40googlegroups.com.

Andre

unread,

May 13, 2014, 2:05:07 PM5/13/14

to mongod...@googlegroups.com

we are currently running 2.4 ( 2.4.6 I believe), we will be migrating to 2.6 after the smoke clears.

sh.status() looks normal and no jumbos, we are excited about that but if only we can speed it up, we need to test querying to ensure that we are good with isolation or scatter/gather and we will be at peace.

shard key: { "_id" : "hashed" }

chunks:

db1shard1 767

db1shard2 775

too many chunks to print, use verbose if you want to force print

yes, I did split the dump into 2 files on separate SSD's and i'm running them both now, guess I should have split them into 50 or something larger than 2 - didn't think much about that.

Yes, you are rightI could technically run 2 per file, 1 from top another from bottom and cut time in half but are the documents stored in order of _id in file? wasn't sure if it was natural order or not. I'm not even sure how I will figure out the _id since these were sharded before, each file had random chunks.

Interesting, I didn't know that mongorestore will try to recreate index if it already exist, one of the things that we did was to create all the indexes ahead of time or I should probably say defined, so we wouldn't have to wait at the end for index build. I thought mongorestore will be smart enough to say index exist so ignore and not drop and recreate. If its going to do that at the end then I guess we will have to stop the processes or can I delete or rename the index files? can I safely delete the *.metadata.json file if its going to drop the already populated index and waste time rebuilding it?

Asya Kamsky

unread,

May 13, 2014, 3:11:07 PM5/13/14

to mongodb-user

Andre,

No worries about (re)creating indexes - attempt to create an index
that already exists is a no-op. But it's always a trade-off, you
don't have to wait for index builds at the end, but your inserts are
somewhat slower because the indexes already exist.

However, I do strongly recommend upgrading to 2.4.10 as there were a
couple of bug fixes that were related to initial sharding of a
collection with a hashed shard key. It doesn't sound like you are in
one of the scenarios where it was more likely to trigger, but it's
always good to keep up with the latest bug fixes.

Asya

> https://groups.google.com/d/msgid/mongodb-user/bdea2e9b-8e4f-4821-9c61-84964a792990%40googlegroups.com.

Andre

unread,

May 13, 2014, 11:21:41 PM5/13/14

to mongod...@googlegroups.com

Thanks Asya, you've been really helpful through these crazy times for us. Yes, we created the indexes ahead of time because we wanted to keep production alive since we have no table scan enabled without the indexes our apps will break.

I'll keep this thread updated with progress, so far we are just seating, watching and waiting for completion. cluster is perfectly balanced - so that's good news. Once this is done we will test query isolation and ensure our MR jobs stay fast enough, add a new shard and plan an upgrade to 2.6.

Andre.

Reply all

Reply to author

Forward