Mongodb: choosing proper shard key and chunk size

Showing 1-17 of 17 messages
Mongodb: choosing proper shard key and chunk size Leon Pajak 4/9/14 1:49 PM

Hi,

I wonder if i should choose more precise shard key or have larger than 64MB chunks.

Here is detailed description of my problem:

My collection "Posts" contain posts which belongs to "Projects" (each "Posts" document contains projectId field). All operations i do always with projectId = ... in query condition, so at the first sight, the best shard key is projectId. Very often i also use "createdDate" field in query condition, but not always.

So i created collection Posts sharded by projectId key. But i realized, that some projects(5% of all projects) contains so many mentions, that total size of all documents with the same projectId will be larger than default chunk size (64MB).

Should i choose more specyfic shard key (eg compound key: projectId, createdDate) to avoid larger chunks than 64MB or should i let the chunks grow more than 64MB? I want to focus on fast read queries(especially aggregations).


Re: [mongodb-user] Mongodb: choosing proper shard key and chunk size Asya Kamsky 4/9/14 11:09 PM
You should absolutely NOT change the chunksize but you *do* need a
more granular shard key.
I recommend you consider projectId,_id as a compound key - using
projectId,createdDate might be problematic since it's unlikely you
always provide the date to queries (and updates!)

There is a good blog post on exactly this use case here:
https://bugsnag.com/blog/mongo-shard-key

(just disregard the advice to shard on _id:"hashed" - you want to read
about sharding on project, _id combination).

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/f80142cd-7a38-4268-90ef-148956173e27%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Re: [mongodb-user] Mongodb: choosing proper shard key and chunk size Sammaye 4/10/14 12:22 AM
My opinion on this, since shard key can be an opinionated topic at times:

Very often i also use "createdDate" field in query condition, but not always.

Because you use it doesn't mean it is fit for the shard key. It is much like indexes in this sense, sometimes some fields might not actually be fit for the index.

What you have got to ask yourself is: Do I update by this field? Do I use this field as a top level key? 

In this sense I would answer no, projectId and _id is used for this stuff. The date is used once you have selected the shards you need.

Now you *could* use hashed _id but I think the reason why @Asya said not to is because if your use of projectId (I think, she doesn't make it clear her reasoning there) as such a hashed _id might not be the best idea sine when you come to query by projectId you must do a scatter and gather operation.

I did see your question on SO but was busy at the time so sorry that I didn't answer it.
Re: [mongodb-user] Mongodb: choosing proper shard key and chunk size Sammaye 4/10/14 12:24 AM
of course, now reading:

But i realized, that some projects(5% of all projects) contains so many mentions, that total size of all documents with the same projectId will be larger than default chunk size (64MB).

Adding the _id will help MongoDB to split up those projects by their _id.
Re: Mongodb: choosing proper shard key and chunk size Leon Pajak 4/10/14 1:27 AM
Thank you very much for answers. So i will change my shard key from {projectId} to {projectId,_id}. I have two more questions:

  1. In old version of my application (on PgSQL without sharding) i am doing all single row updates by WHERE id=.... . If i have in MongoDB compound  shard key  {projectId,_id},should i change update query to use both projectId and _id (projectId = ... AND _id = ..)? Will it be faster and better than query only by _id?
  2. Most of my read queries doesn't query by _id, but if i have read query which extract document by _id should i query only by _id or by projectId = ... and _id=...?
Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Asya Kamsky 4/10/14 1:09 PM
Hi Leon,

It is always going to be more performant and scalable to use the full
shard key when you have it.

So you should absolutely provide project_id and _id when you update or
query by _id (since you always have it).

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/728d4707-5bae-493f-b719-e09b6d22ca35%40googlegroups.com.
Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Leon Pajak 4/13/14 12:47 PM
Hi, thx everyone for advices. I was going to recreate collection with compound sharding key (projectId, _id ), but new problem occurred:

"can't shard collection '...' with unique index on { projectId: 1, url: 1 } and proposed shard key { projectId: 1.0, _id: 1.0 }. Uniqueness can't be maintained unless shard key is a prefix"

I have to have unique constraint on pair (projectId,url). I thought about changing my compound sharding key  to (projectId,url), but i must me able to update url field (sometimes i need to do it). I've read, that fields which are in sharding key, can't be update, so i can't choose (projectId,url) as sharding key.

Knowing all above, i think that i don't have any good options and i need to choose only projectId as sharding key. I know that there will be problem with bigger chunks than 64MB, but i don't see any ohter option. Mayby any of you have any advice to solve that problem? If not, please tell me at least, if having too big chunks isn't very big crime in mongodb :)

Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Asya Kamsky 4/13/14 9:47 PM
I would not be in a rush to make projectId your shard key - I believe you may run into issues if you do that as you will end up with jumbo unsplittable chunks.

The fact that the system can only *enforce* uniqueness in each shard does not mean you cannot have a unique index in a sharded cluster on a non-shard key - it just means you must be aware of a possible race condition which could let two identical values to be inserted simultaneously into two different shards.  If the risk of that in your application is negligible you would be better off sharding on projectId,_id and creating a unique index on projectId,url (you can do it on each individual shard).

Asya



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/58a3310a-c74f-4152-8140-07c3cb3a4d85%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Leon Pajak 4/14/14 1:51 AM
I calculated one more time the future chunk size and i shouldn't use only projectId key to shard. Some chunks my grow to 1GB size...
I
f i have to shard with compound shard key, the prefix of the shard key for sure is projectId. What should be the second field?

  1. If i choose (projectId, url) i will be unable to update url field. But i do it very rarely and if i wanted to do it having (projectId,url) as shard key, I would delete the whole document and insert again with changed url field.
  2. If i choose  (projectId,_id) i will be unable to have unique key (projectId,url). Asya, you wrote that i can in that case create unique index(projectId,url) on each individual shard - how to do it? What if there is on two shards data with the same projectId?

If i usually query by projectId, but not by url or _id, will queries run slower only if  documents with the same projectId doesn't fit in one chunk or the queries will run always slower (even in one chunk there iare all documents with  particular projectId)?





W dniu poniedziałek, 14 kwietnia 2014 06:47:01 UTC+2 użytkownik Asya Kamsky napisał:
I would not be in a rush to make projectId your shard key - I believe you may run into issues if you do that as you will end up with jumbo unsplittable chunks.

The fact that the system can only *enforce* uniqueness in each shard does not mean you cannot have a unique index in a sharded cluster on a non-shard key - it just means you must be aware of a possible race condition which could let two identical values to be inserted simultaneously into two different shards.  If the risk of that in your application is negligible you would be better off sharding on projectId,_id and creating a unique index on projectId,url (you can do it on each individual shard).

Asya



Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Sammaye 4/14/14 4:09 AM
If i choose (projectId, url) i will be unable to update url field. But i do it very rarely and if i wanted to do it having (projectId,url) as shard key, I would delete the whole document and insert again with changed url field. 

That wouldn't be such a great way to do it, granted I haven't read why you need url field but that would cause two sets of IO for every document you update url in, even if rarely.

I actually question why you have a unique constraint on url? I mean that is the URL to the post right (assumption)? 


--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/ddaa9a88-e34c-4a1c-989a-d05a0a9832d6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Leon Pajak 4/14/14 5:00 AM
When i was descripting  my collection in that topic, i simplify a little, to focus on sharding issues. I know that having unique contraint on URL in Posts collection  is silly at the first sight, but the logic of my application is much more complicated and i have to have unique contraint on (projectId, url ) pair.


W dniu poniedziałek, 14 kwietnia 2014 13:09:08 UTC+2 użytkownik Sammaye napisał:
I actually question why you have a unique constraint on url? I mean that is the URL to the post right (assumption)? 
unk...@googlegroups.com 4/15/14 5:46 AM <This message has been deleted.>
Re: Mongodb: choosing proper shard key and chunk size Leon Pajak 4/15/14 5:47 AM
1.I tried to shard by (projectId,url) but i've got error:
Shard Key must be less than 512 bytes#0
Sometimes url is longer, so it is impossible to use URL in shard key.
2. I tried to shard by partly hashed index: db.mentions_old.ensureIndex( { sid:1,u: "hashed" } ) , but it is impossible to create such index ( "err" : "Currently only single field hashed index supported.")

So i can't shard by (projectId,url)

The last solution which remains is to shard by (projectId,_id) pair, but i i must resolve problem i wrote before:

"can't shard collection '...' with unique index on { projectId: 1, url: 1 } and proposed shard key { projectId: 1.0, _id: 1.0 }. Uniqueness can't be maintained unless shard key is a prefix"

Do you have any advice how to shard by (projectId,_id) pair and still have unique contraint in MongoDB on (projectId,url) pair?

Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Tim Hawkins 4/15/14 5:51 AM
Store an extra string consisting of the md5hash of the url, add to that string the creation timestamp , and use that as the secondary shard key. DO NOT update the feild after it has been created, even if the url changes.

Is it possible to have shard key with hased part?
Something like this: {projectId:1,hashed(url):1)} ? 


-- Sent from my Android phone with K-@ Mail. PGP public key available.
Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Asya Kamsky 4/15/14 2:49 PM
That's an interesting idea.

The only negative performance implication is that you (a) are no
longer using full shard key in queries and (b) unless you can
guarantee no hash collisions, you would still not be able to guarantee
uniqueness...

Asya
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user"
> group.
>
> For other MongoDB technical support options, see:
> http://www.mongodb.org/about/support/.
> ---
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mongodb-user...@googlegroups.com.
> To post to this group, send email to mongod...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mongodb-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mongodb-user/ff0c5e3e-e41e-42cd-bad7-1228d4ef86d7%40katmail.1gravity.com.
Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Tim Hawkins 4/15/14 7:51 PM
The only possible collision would occure if you generated two documents with the same hash at exactly the same time. The probability is extreemly low. The only way I could see  an issue would be a bulk import of  records presorted by url hash that had adjacent runs of identical  hashes, only then would you realisticaly get two identical hashes with the same timestamp.
Re: [mongodb-user] Re: Mongodb: choosing proper shard key and chunk size Leon Pajak 4/22/14 12:35 PM
Thank you. I will try solution with creating new field: url_hash, where i put  md5(url).
I will not use timestamp, because i need to have unique on (projectId,md5(url)) pair.