Auto-sharding by date

89 views
Skip to first unread message

octave

unread,
Sep 7, 2010, 6:02:06 AM9/7/10
to mongodb-user
Hi there,

I have the following collection:
> db.Entry.stats()
{
"ns" : "eventr.Entry",
"count" : 8231672,
"size" : 25494603004,
"avgObjSize" : 3097.1354305662326,
"storageSize" : 26905795328,
"numExtents" : 41,
"nindexes" : 6,
"lastExtentSize" : 1926517248,
"paddingFactor" : 1.419999999953223,
"flags" : 0,
"totalIndexSize" : 2887464576,
"indexSizes" : {
"_id_" : 366306240,
"stream_1_extra.uid_1" : 923608000,
"stream_1_publishedAt_-1" : 472892352,
"publishedAt_-1_stream_1" : 500884416,
"createdAt_1_stream_1" : 396231616,
"prevStream_1" : 227541952
},
"ok" : 1
}

But our application mostly needs only recent data from it (last
~1,000,000 by publishedAt -1).

Question:
Is it possible to shard this collection automatically by date?
For example, two collections: "recent" and "archive". And
automatically move items from one to another.

Thanks in advance.

Kyle Banker

unread,
Sep 7, 2010, 7:38:49 AM9/7/10
to mongod...@googlegroups.com
Sharding is designed to partition a single collection evenly across multiple machines. Is that what you're looking to do?

Are you experience performance problems with the current configuration? If your app only needs the the most recent millions, that data should be staying hot in memory.


--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.


Alvin Richards

unread,
Sep 7, 2010, 12:17:16 PM9/7/10
to mongodb-user
If publishedAt is an ever increasing number (e.g. the current date),
then you will always be inserting on one shard, since the range of the
last date inserted to max value is going to reside on a single chunk
and therefore a single shard.

What's your motivation to shard by date?

-Alvin

Markus Gattol

unread,
Sep 7, 2010, 4:26:23 PM9/7/10
to mongod...@googlegroups.com
Alvin> If publishedAt is an ever increasing number (e.g. The current
Alvin> date), then you will always be inserting on one shard, since the
Alvin> range of the last date inserted to max value is going to reside
Alvin> on a single chunk and therefore a single shard.

Sharding
yes, but the config serves will notice and then split chunks and
rebalance. Actually sharding per date is a very common thing. Have a
look at

- http://www.snailinaturtleneck.com/blog/2010/03/30/sharding-with-the-fishes
- http://www.markus-gattol.name/ws/mongodb.html#choose_a_shard_key


Capped Collections
Another thing you might consider (since you only need latest data) are
capped collections.

Alvin Richards

unread,
Sep 7, 2010, 4:48:16 PM9/7/10
to mongodb-user
@Markus
Capped collection is a good suggestion.
Also the snailinaturtleneck also points out that the new documents are
always being inserted into a single shard first, thus creating a hot
spot. Yes, the chunk of data will get re-distrubuted to other shards,
but at insert time there is a single chunk from current_max to
max_value... that's the point I clearly did not make well!

-Alvin

On Sep 7, 1:26 pm, Markus Gattol <markus.gat...@sunoano.org> wrote:
>  Alvin> If publishedAt is an ever increasing number (e.g. The current
>  Alvin> date), then you will always be inserting on one shard, since the
>  Alvin> range of the last date inserted to max value is going to reside
>  Alvin> on a single chunk and therefore a single shard.
>
> Sharding
>   yes, but the config serves will notice and then split chunks and
>   rebalance. Actually sharding per date is a very common thing. Have a
>   look at
>
>    -http://www.snailinaturtleneck.com/blog/2010/03/30/sharding-with-the-f...
>    -http://www.markus-gattol.name/ws/mongodb.html#choose_a_shard_key
Reply all
Reply to author
Forward
0 new messages