sharding on field in embedded document

Rob Dingwell

unread,

Mar 11, 2014, 3:08:00 PM3/11/14

to mongod...@googlegroups.com

We have 2 collections that are sharded by the same set of values.

The first collection , records, is a set of data that we run multiple map reduce jobs on and second collection, patient_cache, contains the results of the map reduce jobs. Sharding the records collection is done off a field value, medical_record_number, in the documents and that works fine. All of the data in the records collection is distributed pretty evenly across all of the shards. The results of the map reduce jobs contain the medical_record_number from the records collection in an embedded document and we have the patient_cache collection setup to shard on that value. The problem we are seeing is that all of the data in the patient_cache is being stored on the primary shard and not distributed across the cluster.

We know its not a problem with the data in the key not being differentiated enough to shard evenly because we already have a collection sharded by that data that works. If we shard the map reduce results on just the _id that works as well but we would like to keep the results from the MR jobs on the same shard as the records they were produced from.

The only thing I can think of is this being related to the shard key being in an embedded document. So, is it possible to shard a collection on a field of an embedded document? If so are there special constraints to sharding on embedded documents? Any ideas as to what else may be causing this?

Thanks

Rob

William Zola

unread,

Mar 12, 2014, 11:49:54 PM3/12/14

to mongod...@googlegroups.com

Hey Rob!

A couple of questions and a comment.

First question: what version of MongoDB are you using?

Second question: It sounds like you're running a 'mapReduce' command against the 'records' collection and using the 'out' parameter to direct the output to the 'patient_cache' collection -- please confirm that this is what you're doing

Third question; would you please post the output of sh.status() from this cluster (you can redact the host names if you like).

The comment is that you're making an incorrect assumption if you think that sharding two collections by the same shard key will ensure that the documents live on the same shard. MongoDB assigns chunk ranges to shards quasi-randomly. Even if you shard these two collections with the same shard key, chunk ranges are likely to be different and the individual chunks will almost certainly be assigned to different shards. (The only exception would be if you're using tagged shards.)

It's not clear what you're trying to accomplish here. If you are trying to later join between the 'records' and 'patient_cache' collections, you'll need to do an application-level join and route the query through 'mongos', so there is absolutely no need to have the documents from the two collections be co-located on a single shard.

-William

Rob Dingwell

unread,

Mar 13, 2014, 4:46:33 PM3/13/14

to mongod...@googlegroups.com

If the assumption is wrong that both collections wouldn't shard the same then it makes no difference,. We can just deal with sharding by _id and not worry about.