My application is simple - I have thousands of timeseries data streams coming into my system. The sampling rates of these are different between streams - one stream might have a sample every hour, another might be sampling at 100Hz.
On the read side, the queries will mostly be based on stream_id and a timestamp range - both known in advance. I am under the impression that as long as my shard key is selective, it would be best to distribute the read load across multiple mongods to get the benefits of map/reduce and/or the aggregation framework (I intend on using this in some of my queries).
The stream name is a string which I intend on hashing and using as the primary shard key.
{
"_id" : ObjectId("5031de4a003f4e731a684961"),
"stream_id" : BinData(5,"LAMgL6/eQyklP3kCc5xEiw=="),
"sample_time" : ISODate("2012-08-01T04:09:56Z"),
"value" : 32.159706115722656
}My two leading options for the shard key are:
{ stream_id_md5 , day_of_year }{ stream_id_md5, sample_time } I'm unsure if it is better to use a discrete value as my second shard key, or time. From what I have read online, and from talking to 10gen employees, using the time is a bad idea as it is monotonically increasing. That means that if a chunk starts to fill up, and it decides to split based on the time, the chunk will be split in half. Because I will not be adding values in the past, that chunk will sit half full in the database. Because of this, I'm leaning towards using day_of_year (which I would have to store in the document - therefore increasing the document size).
Any suggestions on choosing a discrete value vs an increasing time value as part of the shard key?