Collection partitioning

828 views
Skip to first unread message

tracey eubanks

unread,
Oct 29, 2010, 5:26:28 PM10/29/10
to mongod...@googlegroups.com
Is there a way to partition a collection sort of how MySQL handles partitioning on tables?

Thanks
Tracey Eubanks
tra...@pmamediagroup.com
Nova ESP (PMA Media Group)

_____________________________________________________________

NOTICE:

This e-mail transmission, and any documents, files or
previous e-mail messages attached to it is only intended for
the person(s) to whom it is addressed and the information
contained in this message is confidential, privileged,
proprietary information, subject to copyright or constitutes a
trade secret, and is the property of PMA Media Group, Inc.
and its affiliated companies ("PMA")- the disclosure of which
is governed by applicable law. Unless stated to the contrary,
any opinions or comments are personal to the writer and do
not represent the official view of the company. If you are not
the addressee or authorized to receive this for the addressee,
you are hereby notified that any dissemination, copying,
action taken on your part in reliance upon this information by
persons or entities other than the intended recipient, or
distribution of this message, or files associated with this
message is strictly prohibited. If you have received this
message in error, please immediately notify the sender by
telephone (801-705-4400) or return e-mail and delete the
original transmission and its attachments without reading or
saving in any manner. Any reproduction, forwarding, or
copying without the expressed written permission of PMA is
strictly prohibited and no remedy or privilege is waived.
Messages sent to and from PMA may be monitored.

PMA does not accept responsibility for any errors or
omissions that are present in this message, or any
attachment, that have arisen as a result of e-mail
transmission. If verification is required, please request a hard-
copy version from the sender of this message.

Thank you for your cooperation.
___________________________________________________________

Markus Gattol

unread,
Oct 29, 2010, 6:14:48 PM10/29/10
to mongodb-user
that's basically sharding with MongoDB

Eliot Horowitz

unread,
Oct 29, 2010, 10:08:21 PM10/29/10
to mongod...@googlegroups.com
Not really.
We're probably going to allow you to "hint" to the storage layer about
how to cluster data.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Andrew Rollins

unread,
Nov 12, 2010, 12:43:58 AM11/12/10
to mongod...@googlegroups.com
I'm also interested in partitioning. Now that sharding is production ready, it's pretty much the last big thing I really want to see in MongoDB.

I really want to mimic MySQL partitioning so that I can easily drop parts of a collection in an efficient manner. For example, having timestamps in documents and partitions based on those timestamps would allow me to easily drop old documents. This is separate from sharding in that partitions don't dictate how documents are split among shards, instead they dictate how the storage is split out within each shard.

I guess you can partition yourself by just using different collections, but it'd be so much nicer to have it done in the DB.

- Andrew

Ted

unread,
Nov 12, 2010, 8:59:55 PM11/12/10
to mongodb-user
Me too. This one is huge for me. I need an easy way to drop
timestamp'ed
data without having to perform Mongo admin voodoo to reclaim space.
The
lack of this feature may drive us back to a SQL solution.

I know creating separate databases could be a solution but can one
query
across databases easily? My GUI developer is going to kill me if he
has
to create separate queries for each day. And this becomes a bigger
problem
if there are several months of data.

On Nov 12, 12:43 am, Andrew Rollins <and...@localytics.com> wrote:
> I'm also interested in partitioning. Now that sharding is production ready,
> it's pretty much the last big thing I really want to see in MongoDB.
>
> I really want to mimic MySQL partitioning so that I can easily drop parts of
> a collection in an efficient manner. For example, having timestamps in
> documents and partitions based on those timestamps would allow me to easily
> drop old documents. This is separate from sharding in that partitions don't
> dictate how documents are split among shards, instead they dictate how the
> storage is split out within each shard.
>
> I guess you can partition yourself by just using different collections, but
> it'd be so much nicer to have it done in the DB.
>
> - Andrew
>
> On Fri, Oct 29, 2010 at 10:08 PM, Eliot Horowitz <eliothorow...@gmail.com>wrote:
>
> > Not really.
> > We're probably going to allow you to "hint" to the storage layer about
> > how to cluster data.
>
> > On Fri, Oct 29, 2010 at 5:26 PM, tracey eubanks
> > <trac...@pmamediagroup.com> wrote:
> > > Is there a way to partition a collection sort of how MySQL handles
> > partitioning on tables?
>
> > > Thanks
> > > Tracey Eubanks
> > > trac...@pmamediagroup.com
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsu...@googlegroups.com>
> > .
> > > For more options, visit this group at
> >http://groups.google.com/group/mongodb-user?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsu...@googlegroups.com>
> > .

Eliot Horowitz

unread,
Nov 12, 2010, 10:44:49 PM11/12/10
to mongod...@googlegroups.com
Not sure why you need this for that case?
If you delete object that space should be reclaimed pretty normally.

> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

Ted

unread,
Nov 13, 2010, 7:19:06 AM11/13/10
to mongod...@googlegroups.com
Maybe I don't understand how that happens.

Does the DB "reuse" the space or is there some kind of compaction going on.

Ted

unread,
Nov 13, 2010, 7:24:50 AM11/13/10
to mongod...@googlegroups.com
I was a little too quick with my response...

And I need to spend I/O looking for records to drop.  With partitioning there is
almost no work done on the DB side.  In Oracle, drop the partitions and then
drop/delete the tablespace.  Very little I/O and very little CPU usage.  Even with
indexing on Mongo, the DB has to seek to the record and mark it. 

Markus Gattol

unread,
Nov 13, 2010, 7:42:16 AM11/13/10
to mongodb-user
Well, sharding actually is called horizontal partitioning
http://en.wikipedia.org/wiki/Partition_%28database%29 With regards to
vertical partitioning, why does something like

old = new Date("01/01/2008")
db.mycoll.remove({ "timestamp" : { "$lt" : old }})

not work for you? That should work for you no? I mean, why would you
need to do vertical partitioning?

More as a hint because you want something fully automatic out of the
box ... quite different but maybe worth a look http://jira.mongodb.org/browse/SERVER-211

Markus Gattol

unread,
Nov 13, 2010, 7:48:37 AM11/13/10
to mongodb-user
Except for capped collections, MongoDB maintains a so-called "free
list" which kepts track of space that got freed up by removing
documents. This "free list" is then searched when inserting new
documents i.e. if we delete a document and then insert a new one, the
new one takes the space of the old one (if it's of equal size or
smaller than to old one).

Eliot Horowitz

unread,
Nov 13, 2010, 8:14:16 AM11/13/10
to mongod...@googlegroups.com
The db reuses space from deleted objects 

Eliot Horowitz

unread,
Nov 13, 2010, 8:15:40 AM11/13/10
to mongod...@googlegroups.com
Right. The main advantage is clustering so you're not bouncing around the disk. Letting you cluster data by a key is something were looking into

Ted

unread,
Nov 13, 2010, 8:27:37 PM11/13/10
to mongod...@googlegroups.com
Markus

This " db.mycoll.remove({ "timestamp" : { "$lt" : old }})" requires the database to find the entry in the index and then add the entry as free.  This requires I/O and CPU.  On a large collection of data (we have a project where we are expecting anywhere from 30M to 100M rows a day) I expect this to become painful.  With partitioning, its a matter of waiting for the OS to delete data files.

Eliot

I believe if you add paritioning along with some good documentation, you will have a killer product.  It is already a joy to work with.

Is there an easy way to query across collections?  Or does one have to query each one individually?

Andrew Rollins

unread,
Nov 14, 2010, 11:22:53 PM11/14/10
to mongod...@googlegroups.com
I feel like there is some confusion over what is meant by partitioning and why it's useful regarding performance, so I'll just elaborate on what I mean when I say partitioning.

Take MySQL as an example. In MySQL, I can do a statement like:

CREATE TABLE mytable (... some columns ..., created_at DATE)
    PARTITION BY RANGE( YEAR(created_at) )
    PARTITION p0 VALUES LESS THAN 2010
    PARTITION p1 VALUES LESS THAN MAXVALUE;

Creating a table like this splits my data into two physical partitions. Behind the scenes, MySQL creates two separate files backing each partition - it's like I created two tables but that's abstracted from me. Markus, this is still horizontal partitioning, not vertical.

Now here's the key part, when I want to delete data from before 2010, I do "ALTER TABLE mytable DROP PARTITION p0". MySQL then just deletes the underlying partition. So I can drop a ton of old data and it's the equivalent of a file delete. No traversing indexes for matching records and deleting them one by one.

Ted and I are after something that has the performance implications of a file delete for dropping old data.

Eliot, it sounds like clustering will allow for sequential traversal and delete within shards on a some sort of secondary key. That said, won't the individual indexes still need to be marked as deleted and added to the free list? Won't this be more expensive than the MySQL style file partitioning for dropping large swaths of data?

- Andrew

Eliot Horowitz

unread,
Nov 15, 2010, 9:58:54 AM11/15/10
to mongod...@googlegroups.com
Right, we're not planning true partitioning right now.
Clustering gets a lot of the benefit since your won't be disk bound
most of the time.
Still some effort cleaning up indexes, but that's it.
The problem with partitioning in general is that you need to do a
lookup in N indexes instead of 1 for secondary indexes ( N is the
number of partitions ).
Something to consider in the future, but not short term.

Have you considered using capped collections?
We're also going to be doing TTL collections, so things are removed
continuously rather than have to be done in 1 large delete.

Ted

unread,
Nov 15, 2010, 9:58:51 AM11/15/10
to mongod...@googlegroups.com
Andrew - Thanks!  That is what I meant to say!  Exactly the feature I am looking for in MongoDB.  This is what we are currently doing in Oracle today.
 
The only other possible solution I know is to use multiple collections on different databases but I don't think one can easily query across databases.  Eliot, please correct me if I'm wrong.

Markus Gattol

unread,
Nov 15, 2010, 11:17:15 AM11/15/10
to mongodb-user
You can't query across collections/databases. You would have to
maintain different connection (pools) to both of your databases and
handle the return sets in your application.

Andrew Rollins

unread,
Nov 15, 2010, 12:34:17 PM11/15/10
to mongod...@googlegroups.com
On Mon, Nov 15, 2010 at 9:58 AM, Eliot Horowitz <elioth...@gmail.com> wrote:
Right, we're not planning true partitioning right now.
Clustering gets a lot of the benefit since your won't be disk bound
most of the time.
Still some effort cleaning up indexes, but that's it.
The problem with partitioning in general is that you need to do a
lookup in N indexes instead of 1 for secondary indexes ( N is the
number of partitions ).

True, but I'm ok with that because N is probably only 2 in my case. Even if it's larger, N is constant and I can plan around that.
 
Something to consider in the future, but not short term.

Have you considered using capped collections?

The docs say that you can't shard capped collections. I want something I know I can grow by adding nodes (already doing thousands of ops a second now). Is there a way to get around this limitation?
 
We're also going to be doing TTL collections, so things are removed
continuously rather than have to be done in 1 large delete.

TTL would also get me what I need (dropping old data).

- Andrew

Ted

unread,
Nov 15, 2010, 12:37:12 PM11/15/10
to mongod...@googlegroups.com
Any ETA on the TTL and clustering feature?


--

Markus Gattol

unread,
Nov 15, 2010, 12:45:32 PM11/15/10
to mongodb-user
- it says 1.7.X for TTL http://jira.mongodb.org/browse/SERVER-211
- note that by voting you can influence what priority a feature gets

Ted

unread,
Nov 15, 2010, 12:49:04 PM11/15/10
to mongod...@googlegroups.com
Thanks guys for your time.

On Mon, Nov 15, 2010 at 12:45 PM, Markus Gattol <markus...@gmail.com> wrote:
 - it says 1.7.X for TTL http://jira.mongodb.org/browse/SERVER-211
 - note that by voting you can influence what priority a feature gets

--

Andrew Rollins

unread,
Nov 15, 2010, 12:51:16 PM11/15/10
to mongod...@googlegroups.com
Also want to say thanks. Always appreciate the responsiveness and willingness to discuss different approaches.

- Andrew

Markus Gattol

unread,
Nov 15, 2010, 1:16:20 PM11/15/10
to mongodb-user
Yes, good and fruitful discussion. I created a ticket
http://jira.mongodb.org/browse/SERVER-2097
You are welcome to vote :)
Reply all
Reply to author
Forward
0 new messages