Shard key and compound indexes

Franck Lefebure

unread,

Dec 22, 2016, 3:32:35 PM12/22/16

to mongodb-user, sylvai...@softbridge.fr

Hi,

I'm new with Mongo...

We work on time series datas and are thinking about sharding strategies.

Our first think is to shard our data with a compound index like this one :

{"user":1, "timestamp":1}

(user is a low-cardinality field)

In my understanding, with this shard key, we will have hot spots (heavy write on one shard) in write operations. as the writing target mainly one specific, and gliding, timestamp period.

So we should add a third field to our compound shard key, eg the event _id (let's presume that the _id is not monotonous)

{"user":1, "timestamp":1, "_id":1}

IMO, with this shard key, write hotspots should be avoided (?)

But whatabout if we need other indexes, it seems that shard index key must be a prefix of other indexes so we are able to create this index:

"test_idx" : {"user":1, "timestamp":1, "_id":1, "external_reference" : 1}

Am I ok with that ?

What about if we query the collection but with missing constraints for some shard key member eg :

db.getCollection('test').find({'user':1, 'external_reference' : 'abcd', 'timestamp' : {'$gte' : new ISODate('2016-12-22T21:00:00Z')}})

Will the query use the "test_idx" index in the best way ?

if no, what should be the alternatives ?

Thanks for your advices

Franck

Kevin Adistambha

unread,

Dec 29, 2016, 1:19:59 AM12/29/16

to mongodb-user, sylvai...@softbridge.fr

Hi Franck

We work on time series datas and are thinking about sharding strategies.

Are you storing each event in a single document? Typically, time series data would require aggregation based on some timestamp for reporting purposes. However, you are correct that shard key should be designed to avoid “hot shard/hot spot” in mind.

So we should add a third field to our compound shard key, eg the event _id (let’s presume that the _id is not monotonous)
{“user”:1, “timestamp”:1, “_id”:1}

This depends on your intended use case, and also the form of your _id field. However, it’s usually best not to include a random value in the shard key if you plan to perform some kind of range queries.

But whatabout if we need other indexes, it seems that shard index key must be a prefix of other indexes

The shard key would govern on which chunk a document will be stored (and in turn, in which shard). You are correct that for the best query efficiency, the shard key should be the prefix of other indexes.

It may be worthwhile to examine the blog post regarding storing time series data: Schema Design for Time Series Data in MongoDB. Also, the presentation MongoDB for Time Series Data Part 3: Sharding may be of interest to you.

Best regards,
Kevin

Franck Lefebure

unread,

Dec 31, 2016, 1:40:45 PM12/31/16

to mongodb-user, sylvai...@softbridge.fr

Hello Kevin,

I've have seen the posts you quoted.

Nevertheless, I have still doubts

- ... for a collection of time based serie datas. {"_id", "timestamp", "the_data", "some_criterion"}

- shard key contains the data timestamp {"timestamp" : 1}

- to have a better write distribution, add another field in the shard key to increase cardinality ==> {"timestamp" : 1, "some_criterion" : 1}

At this point, I figure that datas are organized in chunks distributed on shards, each chunk targeting specifics timestamp and "some_criterion" ranges

If I query this collection with a single criteriatargeting the "timestamp" field, will Mongo be able to exploit the sharding (and so scanning only chunks for the targeted timestamp period, and all "some_criterion" values) or will mongo do a full scan on every chunks ?

Franck

Kevin Adistambha

unread,

Jan 3, 2017, 12:05:09 AM1/3/17

to mongodb-user, sylvai...@softbridge.fr

Hi Franck

At this point, I figure that datas are organized in chunks distributed on shards, each chunk targeting specifics timestamp and “some_criterion” ranges

Correct.

If I query this collection with a single criteriatargeting the “timestamp” field, will Mongo be able to exploit the sharding (and so scanning only chunks for the targeted timestamp period, and all “some_criterion” values) or will mongo do a full scan on every chunks ?

One of the main concern of choosing a shard key is to achieve a read/write distribution balance that is optimal for your use case. For example, using a {_id:"hashed"} shard key may achieve the best write distribution, but unless your queries can target a specific _id, your queries will likely be a scatter-gather. On the other end of the spectrum, using {timestamp:1}shard key may result in poor write distribution, but queries for a certain timestamp (or a limited range of timestamps) will likely be targeting a single chunk. I’m not certain what you mean by “full scan”. If your query includes the shard key, MongoDB will not do a scan.