Mongodb: choosing proper shard key and chunk size

1,249 views
Skip to first unread message

Leon Pajak

unread,
Apr 9, 2014, 4:49:41 PM4/9/14
to mongod...@googlegroups.com

Hi,

I wonder if i should choose more precise shard key or have larger than 64MB chunks.

Here is detailed description of my problem:

My collection "Posts" contain posts which belongs to "Projects" (each "Posts" document contains projectId field). All operations i do always with projectId = ... in query condition, so at the first sight, the best shard key is projectId. Very often i also use "createdDate" field in query condition, but not always.

So i created collection Posts sharded by projectId key. But i realized, that some projects(5% of all projects) contains so many mentions, that total size of all documents with the same projectId will be larger than default chunk size (64MB).

Should i choose more specyfic shard key (eg compound key: projectId, createdDate) to avoid larger chunks than 64MB or should i let the chunks grow more than 64MB? I want to focus on fast read queries(especially aggregations).


Asya Kamsky

unread,
Apr 10, 2014, 2:09:37 AM4/10/14
to mongodb-user
You should absolutely NOT change the chunksize but you *do* need a
more granular shard key.
I recommend you consider projectId,_id as a compound key - using
projectId,createdDate might be problematic since it's unlikely you
always provide the date to queries (and updates!)

There is a good blog post on exactly this use case here:
https://bugsnag.com/blog/mongo-shard-key

(just disregard the advice to shard on _id:"hashed" - you want to read
about sharding on project, _id combination).

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/f80142cd-7a38-4268-90ef-148956173e27%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sam Millman

unread,
Apr 10, 2014, 3:22:19 AM4/10/14
to mongod...@googlegroups.com
My opinion on this, since shard key can be an opinionated topic at times:

Very often i also use "createdDate" field in query condition, but not always.

Because you use it doesn't mean it is fit for the shard key. It is much like indexes in this sense, sometimes some fields might not actually be fit for the index.

What you have got to ask yourself is: Do I update by this field? Do I use this field as a top level key? 

In this sense I would answer no, projectId and _id is used for this stuff. The date is used once you have selected the shards you need.

Now you *could* use hashed _id but I think the reason why @Asya said not to is because if your use of projectId (I think, she doesn't make it clear her reasoning there) as such a hashed _id might not be the best idea sine when you come to query by projectId you must do a scatter and gather operation.

I did see your question on SO but was busy at the time so sorry that I didn't answer it.


Sam Millman

unread,
Apr 10, 2014, 3:24:50 AM4/10/14
to mongod...@googlegroups.com
of course, now reading:

But i realized, that some projects(5% of all projects) contains so many mentions, that total size of all documents with the same projectId will be larger than default chunk size (64MB).

Adding the _id will help MongoDB to split up those projects by their _id.

Leon Pajak

unread,
Apr 10, 2014, 4:27:49 AM4/10/14
to mongod...@googlegroups.com
Thank you very much for answers. So i will change my shard key from {projectId} to {projectId,_id}. I have two more questions:

  1. In old version of my application (on PgSQL without sharding) i am doing all single row updates by WHERE id=.... . If i have in MongoDB compound  shard key  {projectId,_id},should i change update query to use both projectId and _id (projectId = ... AND _id = ..)? Will it be faster and better than query only by _id?
  2. Most of my read queries doesn't query by _id, but if i have read query which extract document by _id should i query only by _id or by projectId = ... and _id=...?

Asya Kamsky

unread,
Apr 10, 2014, 4:09:02 PM4/10/14
to mongodb-user
Hi Leon,

It is always going to be more performant and scalable to use the full
shard key when you have it.

So you should absolutely provide project_id and _id when you update or
query by _id (since you always have it).

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/728d4707-5bae-493f-b719-e09b6d22ca35%40googlegroups.com.

Leon Pajak

unread,
Apr 13, 2014, 3:47:29 PM4/13/14
to mongod...@googlegroups.com
Hi, thx everyone for advices. I was going to recreate collection with compound sharding key (projectId, _id ), but new problem occurred:

"can't shard collection '...' with unique index on { projectId: 1, url: 1 } and proposed shard key { projectId: 1.0, _id: 1.0 }. Uniqueness can't be maintained unless shard key is a prefix"

I have to have unique constraint on pair (projectId,url). I thought about changing my compound sharding key  to (projectId,url), but i must me able to update url field (sometimes i need to do it). I've read, that fields which are in sharding key, can't be update, so i can't choose (projectId,url) as sharding key.

Knowing all above, i think that i don't have any good options and i need to choose only projectId as sharding key. I know that there will be problem with bigger chunks than 64MB, but i don't see any ohter option. Mayby any of you have any advice to solve that problem? If not, please tell me at least, if having too big chunks isn't very big crime in mongodb :)

Asya Kamsky

unread,
Apr 14, 2014, 12:47:01 AM4/14/14
to mongodb-user
I would not be in a rush to make projectId your shard key - I believe you may run into issues if you do that as you will end up with jumbo unsplittable chunks.

The fact that the system can only *enforce* uniqueness in each shard does not mean you cannot have a unique index in a sharded cluster on a non-shard key - it just means you must be aware of a possible race condition which could let two identical values to be inserted simultaneously into two different shards.  If the risk of that in your application is negligible you would be better off sharding on projectId,_id and creating a unique index on projectId,url (you can do it on each individual shard).

Asya



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

Leon Pajak

unread,
Apr 14, 2014, 4:51:18 AM4/14/14
to mongod...@googlegroups.com
I calculated one more time the future chunk size and i shouldn't use only projectId key to shard. Some chunks my grow to 1GB size...
I
f i have to shard with compound shard key, the prefix of the shard key for sure is projectId. What should be the second field?

  1. If i choose (projectId, url) i will be unable to update url field. But i do it very rarely and if i wanted to do it having (projectId,url) as shard key, I would delete the whole document and insert again with changed url field.
  2. If i choose  (projectId,_id) i will be unable to have unique key (projectId,url). Asya, you wrote that i can in that case create unique index(projectId,url) on each individual shard - how to do it? What if there is on two shards data with the same projectId?

If i usually query by projectId, but not by url or _id, will queries run slower only if  documents with the same projectId doesn't fit in one chunk or the queries will run always slower (even in one chunk there iare all documents with  particular projectId)?

Sam Millman

unread,
Apr 14, 2014, 7:09:08 AM4/14/14
to mongod...@googlegroups.com
If i choose (projectId, url) i will be unable to update url field. But i do it very rarely and if i wanted to do it having (projectId,url) as shard key, I would delete the whole document and insert again with changed url field. 

That wouldn't be such a great way to do it, granted I haven't read why you need url field but that would cause two sets of IO for every document you update url in, even if rarely.

I actually question why you have a unique constraint on url? I mean that is the URL to the post right (assumption)? 


--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.

Leon Pajak

unread,
Apr 14, 2014, 8:00:11 AM4/14/14
to mongod...@googlegroups.com
When i was descripting  my collection in that topic, i simplify a little, to focus on sharding issues. I know that having unique contraint on URL in Posts collection  is silly at the first sight, but the logic of my application is much more complicated and i have to have unique contraint on (projectId, url ) pair.
Message has been deleted

Leon Pajak

unread,
Apr 15, 2014, 8:47:55 AM4/15/14
to mongod...@googlegroups.com
1.I tried to shard by (projectId,url) but i've got error:
Shard Key must be less than 512 bytes#0
Sometimes url is longer, so it is impossible to use URL in shard key.
2. I tried to shard by partly hashed index: db.mentions_old.ensureIndex( { sid:1,u: "hashed" } ) , but it is impossible to create such index ( "err" : "Currently only single field hashed index supported.")

So i can't shard by (projectId,url)

The last solution which remains is to shard by (projectId,_id) pair, but i i must resolve problem i wrote before:

"can't shard collection '...' with unique index on { projectId: 1, url: 1 } and proposed shard key { projectId: 1.0, _id: 1.0 }. Uniqueness can't be maintained unless shard key is a prefix"

Do you have any advice how to shard by (projectId,_id) pair and still have unique contraint in MongoDB on (projectId,url) pair?

Tim Hawkins

unread,
Apr 15, 2014, 8:51:05 AM4/15/14
to mongod...@googlegroups.com
Store an extra string consisting of the md5hash of the url, add to that string the creation timestamp , and use that as the secondary shard key. DO NOT update the feild after it has been created, even if the url changes.

On Apr 15, 2014, Leon Pajak <mojca...@gmail.com> wrote:
1.I tried to shard by (projectId,url) but i've got error:
Shard Key must be less than 512 bytes#0
Sometimes url is longer, so it is impossible to use URL in shard key.
2. I tried to shard by partly hashed index: db.mentions_old.ensureIndex( { sid:1,u: "hashed" } ) , but it is impossible to create such index ( "err" : "Currently only single field hashed index supported.")

So i can't shard by (projectId,url)

The last solution which remains is to shard by (projectId,_id) pair, but i i must resolve problem i wrote before:
"can't shard collection '...' with unique index on { projectId: 1, url: 1 } and proposed shard key { projectId: 1.0, _id: 1.0 }. Uniqueness can't be maintained unless shard key is a prefix"

Do you have any advice how to shard by (projectId,_id) pair and still have unique contraint in MongoDB on (projectId,url) pair?



Is it possible to have shard key with hased part?
Something like this: {projectId:1,hashed(url):1)} ? 


-- Sent from my Android phone with K-@ Mail. PGP public key available.

Asya Kamsky

unread,
Apr 15, 2014, 5:49:52 PM4/15/14
to mongodb-user
That's an interesting idea.

The only negative performance implication is that you (a) are no
longer using full shard key in queries and (b) unless you can
guarantee no hash collisions, you would still not be able to guarantee
uniqueness...

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/ff0c5e3e-e41e-42cd-bad7-1228d4ef86d7%40katmail.1gravity.com.

Tim Hawkins

unread,
Apr 15, 2014, 10:51:25 PM4/15/14
to mongod...@googlegroups.com
The only possible collision would occure if you generated two documents with the same hash at exactly the same time. The probability is extreemly low. The only way I could see  an issue would be a bulk import of  records presorted by url hash that had adjacent runs of identical  hashes, only then would you realisticaly get two identical hashes with the same timestamp.

Leon Pajak

unread,
Apr 22, 2014, 3:35:43 PM4/22/14
to mongod...@googlegroups.com
Thank you. I will try solution with creating new field: url_hash, where i put  md5(url).
I will not use timestamp, because i need to have unique on (projectId,md5(url)) pair.



Reply all
Reply to author
Forward
0 new messages