{"user":1, "timestamp":1}
{"user":1, "timestamp":1, "_id":1}
"test_idx" : {"user":1, "timestamp":1, "_id":1, "external_reference" : 1}
db.getCollection('test').find({'user':1, 'external_reference' : 'abcd', 'timestamp' : {'$gte' : new ISODate('2016-12-22T21:00:00Z')}})
Hi Franck
We work on time series datas and are thinking about sharding strategies.
Are you storing each event in a single document? Typically, time series data would require aggregation based on some timestamp for reporting purposes. However, you are correct that shard key should be designed to avoid “hot shard/hot spot” in mind.
So we should add a third field to our compound shard key, eg the event _id (let’s presume that the _id is not monotonous)
{“user”:1, “timestamp”:1, “_id”:1}
This depends on your intended use case, and also the form of your _id
field. However, it’s usually best not to include a random value in the shard key if you plan to perform some kind of range queries.
But whatabout if we need other indexes, it seems that shard index key must be a prefix of other indexes
The shard key would govern on which chunk a document will be stored (and in turn, in which shard). You are correct that for the best query efficiency, the shard key should be the prefix of other indexes.
It may be worthwhile to examine the blog post regarding storing time series data: Schema Design for Time Series Data in MongoDB. Also, the presentation MongoDB for Time Series Data Part 3: Sharding may be of interest to you.
Best regards,
Kevin
Hi Franck
At this point, I figure that datas are organized in chunks distributed on shards, each chunk targeting specifics timestamp and “some_criterion” ranges
Correct.
If I query this collection with a single criteriatargeting the “timestamp” field, will Mongo be able to exploit the sharding (and so scanning only chunks for the targeted timestamp period, and all “some_criterion” values) or will mongo do a full scan on every chunks ?
One of the main concern of choosing a shard key is to achieve a read/write distribution balance that is optimal for your use case. For example, using a {_id:"hashed"}
shard key may achieve the best write distribution, but unless your queries can target a specific _id
, your queries will likely be a scatter-gather. On the other end of the spectrum, using {timestamp:1}
shard key may result in poor write distribution, but queries for a certain timestamp
(or a limited range of timestamp
s) will likely be targeting a single chunk. I’m not certain what you mean by “full scan”. If your query includes the shard key, MongoDB will not do a scan.
For more details regarding this subject, please see Query isolation
Best regards,
Kevin