Re: Shard Key for Time Series Data

Ben Mackey

unread,

Aug 27, 2012, 1:39:35 AM8/27/12

to mongod...@googlegroups.com

That is true, but I've done the maths and think I should have enough room in day_of_year without it filling up.

Ideally I would prefer to just use { stream_id_md5, sample_time } so I don't need to store another field, but I'm not sure what the implications of using a monotonically increasing key will be. From what I can tell, any chunk that is split on time will be half full (as I am rarely inserting historical data).

I would love to see a flag in MongoDB when creating a shard that allows it to know the key is monotonically increasing and create a new chunk instead of splitting the old one.

On Friday, 24 August 2012 08:23:49 UTC+10, Samuel García Martínez wrote:

As explained in http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations chunks has max size. Since shard key is used to create key ranges, using a discrete value like "day_of_year" you may have to set higher chunk size. It's very hard to exceed 64Mb/millisecond per stream (date), but it may exceed 64Mb/day_of_year.
Also, if you plan to run your app over the years, partitioning data using day_of_year becomes harder.

On Wednesday, August 22, 2012 5:14:13 AM UTC+2, Ben Mackey wrote:
Yet another shard key question - my apologies and thanks in advance.

My application is simple - I have thousands of timeseries data streams coming into my system. The sampling rates of these are different between streams - one stream might have a sample every hour, another might be sampling at 100Hz.

On the read side, the queries will mostly be based on stream_id and a timestamp range - both known in advance. I am under the impression that as long as my shard key is selective, it would be best to distribute the read load across multiple mongods to get the benefits of map/reduce and/or the aggregation framework (I intend on using this in some of my queries).

The stream name is a string which I intend on hashing and using as the primary shard key.

An example document is
{
	"_id" : ObjectId("5031de4a003f4e731a684961"),
	"stream_id" : BinData(5,"LAMgL6/eQyklP3kCc5xEiw=="),
	"sample_time" : ISODate("2012-08-01T04:09:56Z"),
	"value" : 32.159706115722656
}
My two leading options for the shard key are:
{ stream_id_md5 , day_of_year }
{ stream_id_md5, sample_time } 
I'm unsure if it is better to use a discrete value as my second shard key, or time. From what I have read online, and from talking to 10gen employees, using the time is a bad idea as it is monotonically increasing. That means that if a chunk starts to fill up, and it decides to split based on the time, the chunk will be split in half. Because I will not be adding values in the past, that chunk will sit half full in the database. Because of this, I'm leaning towards using day_of_year (which I would have to store in the document - therefore increasing the document size).
Any suggestions on choosing a discrete value vs an increasing time value as part of the shard key?

William Z

unread,

Aug 27, 2012, 7:00:16 PM8/27/12

to mongod...@googlegroups.com

Hi Ben!

In general, choosing a shard key is hard. The reason it's hard is that a "perfect" shard key would satisfy three mutually exclusive goals:
    - Writes should be distributed evenly over the shards
    - Queries of individual documents should be distributed evenly over the shards
    - Range queries and sorts should be efficient, which means that elements in a sequence should all be on the *same* shard.

Here are some good links with useful information on sharding in MongoDB:

- http://www.mongodb.org/display/DOCS/Sharding+Introduction
- http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
- http://www.kchodorow.com/blog/2011/01/04/how-to-choose-a-shard-key-the-card-game/
- http://www.10gen.com/presentations/MongoNYC-2012/sharding-overview
- http://www.10gen.com/presentations/MongoNYC-2012/Sharding-Best-Practices-Advanced

With all of that said, the point in avoiding a monotonically increasing shard key is to prevent a single shard from becoming a "hot" shard -- the one where all of the writes are going. Given that you're splitting up your writes using the stream ID, there's no issue with using a monotonically increasing value as your secondary key. On the contrary, having a highly granular secondary key will guarantee that you'll always be able to split the chunks, which is a good thing.

Given what you've said, I'd go with the streamID and the sample time as your shard key.

Let me know if you have further questions.

-William

Ben Mackey

unread,

Aug 27, 2012, 7:53:04 PM8/27/12

to mongod...@googlegroups.com

Thanks William, that is what I needed to hear. I hope that this helps other people looking at storing sampled / time series data in MongoDB and wondering what shard key to use.

Reply all

Reply to author

Forward