ranged sharding vs hashed sharding

1,365 views
Skip to first unread message

sjmi...@gmail.com

unread,
Oct 24, 2016, 7:51:34 AM10/24/16
to mongodb-user
Hello All,
I am designing the sharding strategy for one of my collection.
Each object in that collection has a timestamp field which I find best to use as my shard key.

I wanted to know what kind of sharding should I go for:
1. Ranged Sharding https://docs.mongodb.com/manual/core/ranged-sharding
or
2. Hashed Sharding https://docs.mongodb.com/manual/core/hashed-sharding

Based on the links it states that a key like timestamp is ideal for hashed sharding.
Also almost all of my queries will involve timestamp field range.

So here I am slightly confused, would hashed sharding be best for my case or ranged sharding.

Also I would be using mongo spark connector to pull data into spark, using the timestamp field as region splitting key using some class like
MongoTimePartioner extends MongoPartitioner
In the implementation I would create partition boundaries by dividing the key into several min-max ranges to query the db in parallel.

So keeping this also in mind, what would be best to choose my sharding strategy.

Thanks
Sachin

Kevin Adistambha

unread,
Nov 2, 2016, 11:33:34 PM11/2/16
to mongodb-user

Hi Sachin

I am designing the sharding strategy for one of my collection.
Each object in that collection has a timestamp field which I find best to use as my shard key.

Could you post an example document? Do you plan to use only the timestamp as the shard key, or are you using the timestamp as part of a compound shard key?

Shard key selection is an important step in your schema design, since once created, the shard key is immutable. That is, if later on you discover that the shard key is not the best, you would have to dump the collection, recreate the collection with a different shard key, and re-import all the data back in.

Using a monotonically increasing shard key by itself (i.e. not using a compound key where the timestamp is one element of the key) will artificially limit your insert rate. This is because the insert will always happen on the chunk having MaxKey on it. Since there could only be one chunk having the MaxKey, and a single chunk can only be located in one shard, your insert rate is practically limited to the capacity of a single machine, no matter how many shards you have. Also, you cannot change this fact in the future unless you dump/recreate/restore the collection.

For more information regarding shard key selection, please see:

Also almost all of my queries will involve timestamp field range.
So here I am slightly confused, would hashed sharding be best for my case or ranged sharding.

If the shard key only contains a single monotonically increasing value (e.g. ObjectId, timestamp), then using a hashed shard key is the recommended approach, since inserts will be more spread out across all the shards. The tradeoff is that you cannot do a range-based query on a hashed value. If you require a range query based on timestamp, then it is recommended to use a compound key, in which the timestamp forms part of the key.

I would suggest you test the shard key in a test environment using the expected workload, and check whether using a monotonically increasing shard key is acceptable to your use case.

Best regards,
Kevin

Sachin Mittal

unread,
Nov 3, 2016, 10:16:48 AM11/3/16
to mongod...@googlegroups.com
Hi,
Example document is like this:
{
ts: <timestamp>,
un: "user name",
pn: "page name",
ss: "session",
......
}

ts is monotonically increasing and never modified.
Also ts would be part of range queries in isolation and also in combination of other fields

so some example of queries would be like:

db.collection.find({
                ts: {$gte: start, $lt: end}
            })

or

db.collection.find({
                ts: {$gte: start, $lt: end},
                pn: {$in: [pn1, pn2, pn3 ....]},
            })

or

db.collection.find({
                ts: {$gte: start, $lt: end},
                un: username
            })

or

db.collection.find({
                ts: {$gte: start, $lt: end},
                ss: session-id
            })

What I understand that simply using {ts: 1} as shard key would help in my queries but won't help in insert.
Since I am also inserting 1000+ documents say per minute I need inserts also to be well distributed.
Hence hash(ts) is better option here, but then my range queries will not be fast, which is something I need too.

So what I understand from your suggestion it is better to use compound key has my shard key.

So would a key like {ts:1, pn: 1, un: 1, ss: 1} would be the way to go about it?
As mostly these attributes would be used with ts when querying the data.
Also please let me know if this is the case then would other attributes like (pn, un, ss) need to be present in all documents.
What if they are missing (as they can be) in some documents. So when they are missing and still used as part of compounded shard key would they cause any issues?


So yes initially I was planing to use just ts as shard key, but looks like this is not an option here.

Please suggest what would be the best way towards shard key selection here.

Thanks
Sachin



--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/0a7a44b4-ef00-42cf-817c-6b510866ff26%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Kevin Adistambha

unread,
Nov 3, 2016, 10:49:56 PM11/3/16
to mongodb-user

Hi Sachin

What I understand that simply using {ts: 1} as shard key would help in my queries but won’t help in insert.

This is correct.

So what I understand from your suggestion it is better to use compound key has my shard key.

This is also correct, although the wording I prefer is: if required to have a monotonically increasing field as a shard key, use a compound key as a shard key, so that the shard key does not contain only a single field that is monotonically increasing.

So would a key like {ts:1, pn: 1, un: 1, ss: 1} would be the way to go about it?

That is one possible shard key that could satisfy the queries you posted. However, I would recommend you to create a test deployment and extensively check the explain() output of more example queries that you will use in production before committing to any shard key selection. Particularly, if you need to sort the results.

What you want to avoid seeing in the explain() output is stages with "COLLSCAN" (which means a collection scan, i.e. MongoDB is forced to examine every document in the collection), and "SORT_KEY_GENERATOR" (which means an in-memory sorting stage, which is limited to 32 MB). See Use Indexes to Sort Query Results for more information.

What if they are missing (as they can be) in some documents. So when they are missing and still used as part of compounded shard key would they cause any issues?

If you include a field as part of the shard key, that field cannot be missing from any document. This is because the entire shard key is used to decide on which shard and chunk that document belongs. If there are missing fields, MongoDB cannot discover this information.

Best regards,
Kevin

Reply all
Reply to author
Forward
0 new messages