Choose my shard key

66 views
Skip to first unread message

Rui Goncalves

unread,
Jul 18, 2016, 12:29:52 PM7/18/16
to mongodb-user

I need to implement sharding in a collection  that has a sequential identifier number called “TID” on each document. Everyday millions of these documents are inserted.

The “TID” it's used by some systems that look at this collection and query for TID’s in a particular  range.

There are other systems that look at  the collection and query on a field with a typical user identification called “Username”. There are some millions of users.  

I also would like to be able to eliminate documents based on the date or month they were inserted, for disk space maintenance puposes, lets call it “Insert date” or “Insert Month”

Also there’s a field on the collection called “fingerprint” that’s used to ensure that there aren’t duplicated documents, and so that “fingerprint” needs to be unique on the shard.

Which should be my Shard Key?

Rui Goncalves

unread,
Jul 20, 2016, 5:21:30 AM7/20/16
to mongodb-user
Correction:  the “fingerprint” needs to be unique on the collection.

sjmi...@gmail.com

unread,
Jul 22, 2016, 2:25:21 AM7/22/16
to mongodb-user
In simple terms shard key should be one whose many different value exists.
Also one which can be part of your query because this was it will query the right partition rather than all nodes.

Try a combination of TID, Username, Insert Date

Again this is just my opinion, I am also learning this.
https://docs.mongodb.com/manual/core/sharding-shard-key/#choosing-a-shard-key
https://www.mongodb.com/blog/post/on-selecting-a-shard-key-for-mongodb

Rui Goncalves

unread,
Jul 22, 2016, 12:44:25 PM7/22/16
to mongodb-user
Thank you

Senthilkumar Kamaraj

unread,
Jul 23, 2016, 5:12:04 PM7/23/16
to mongodb-user
Was that answered your question ?  if not, to decide your sharding key, you have to check with application team which where condition they use to fetch records ?  Pls do the analysis and configure shading key. 

Pooja Gupta

unread,
Jul 25, 2016, 10:38:30 AM7/25/16
to mongodb-user

Hi Rui,

Choosing a Shard Key is very use case specific which requires many factors to be considered.

Some of them are:

  • The shard key should cater to the most frequent queries that are performed on the collection so that those queries may be efficiently routed to a single target shard that holds the data as opposed to broadcasting to every shard in the cluster.
  • It should have good cardinality/granularity.If you pick a shard key that is not granular enough, you might find that the chunks it makes cannot be split.
    [Compound shard key can also be considered, if needed, for granularity].
  • Shard key with monotonically increasing values on inserts is more likely to put inserts to a single shard within the cluster, which could limit the insertion performance.Ideally, your shard key should allow insert operations to be spread out among the shards.

The “TID” it’s used by some systems that look at this collection and query for TID’s in a particular range.

This is a monotonically increasing field, which makes it a good candidate for Hashed Shard key.
However, a hashed shard key requies that only that one field is contained in the shard key. E.g., you cannot do

{a:'hashed', b:1},

Or

{a:'hashed', b:'hashed'}

Hashed key also prevents efficient range query.

There are other systems that look at the collection and query on a field with a typical user identification called “Username”.

So an alternative key could be a compound between TID & Username. Compound shard key mitigates the effect of monotonically increasing key.
So two possibilities:

{Username:1, TID:1}, or
{TID:1, Username:1}

In case of {Username:1, TID:1} you could query by:

  • Username
  • Username, TID

and still use the same index.

In case of {TID:1, Username:1} you could query by:

  • TID
  • TID, Username

and still use the same index.

I also would like to be able to eliminate documents based on the date or month they were inserted, for disk space maintenance puposes, lets call it “Insert date” or “Insert Month”

If by “eliminate” you mean “deletion” then this can be
dealt effectively with TTL indexes. TTL indexes are special indexes for removing documents in a collection after a certain amount of time. So you can simply create the TTL index on the date field of your choice.Each individual shard will take care of the documents which are to be deleted.

Also there’s a field on the collection called “fingerprint” that’s used to ensure that there aren’t duplicated documents, and so that “fingerprint” needs to be unique on the collection.

Could you please elaborate on your purpose of sharding and content of fingerprint field?

You cannot create a unique index on any other field in a sharded collection which is not a part of Shard Key.
Furthermore, this uniqueness is enforced on the entire key combination and not on individual components of the shard key.Hence, uniqueness of fingerprint field can be enforced on application level only.

However, you may need to try different permutations of the discussed fields for your shard key. Since a good shard key is very use-case specific, only your own testing could determine which order is best for you. Please note that the shard key is immutable once created.

Regards,

Pooja

Reply all
Reply to author
Forward
0 new messages