I'm trying to remove records that match a very simple query (involving one unindexed field).
There are about 266 million entries in the database (single instance, one machine), and about 200 million of them will wind up being deleted.
It looks like my command: db.collection.remove({ "field":"value" }) is deleting about 500 records per second.
By my calculations, this will take about 5 days to complete.
Is mongo simply not the correct database for 266 million "small" documents? Am I running the wrong query? It seems inadvisable to keep an index on every field, but sometimes I will have to run queries involving un-indexed fields.
On Fri, Aug 3, 2012 at 1:17 PM, jordan <neutrin...@gmail.com> wrote:
> I'm trying to remove records that match a very simple query (involving one
> unindexed field).
> There are about 266 million entries in the database (single instance, one
> machine), and about 200 million of them will wind up being deleted.
> It looks like my command: db.collection.remove({ "field":"value" }) is
> deleting about 500 records per second.
> By my calculations, this will take about 5 days to complete.
> Is mongo simply not the correct database for 266 million "small"
> documents? Am I running the wrong query? It seems inadvisable to keep an
> index on every field, but sometimes I will have to run queries involving
> un-indexed fields.
> Thank You--
> --Jordan
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
On Fri, Aug 3, 2012 at 2:48 PM, jordan <neutrin...@gmail.com> wrote:
> I have switched to using a simple python script that iterates over a
> cursor for all records. This seems to be about 10X faster than remove().
> *I can post mongostat output once this query finishes* (about one day). I
> don't want to destroy data while I'm iterating.
> This is mongostat for the query that is executing currently (iterate
> manually). Stay tuned for an update.
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
On Friday, August 3, 2012 1:17:26 PM UTC-4, jordan wrote:
> Is mongo simply not the correct database for 266 million "small" > documents? Am I running the wrong query? It seems inadvisable to keep an > index on every field, but sometimes I will have to run queries involving > un-indexed fields.
Without an index for the query, you're inviting a table scan across all 266 million documents. Even if your entire working set fits in memory, the initial page faults are going to cause disk to be read into memory the first time. Additionally, each insert/delete is going to require updating each index on the collection.
Depending on your use case and if this is a one-time migration, an alternative approach may be to insert the 66 million documents you intend to keep into a new collection and then recreate the indexes you need and swap the collections. You could use the lack of concurrency in db.eval()<http://www.mongodb.org/display/DOCS/Server-side+Code+Execution#Server...> to your advantage to drop the old collection and rename the newly created collection atomically. Also, consider these are small documents, inserting them fresh would help avoid the fragmentation you would expiration with a mass-deletion (unless you're willing to compact()<http://www.mongodb.org/display/DOCS/compact+Command> afterwards).
*Were you running remove() from the shell before? Or from a driver? * This was in the mongo shell.
Jeremy,
*Even if your entire working set fits in memory, the initial page faults are going to cause disk to be read into memory the first time. * By "small" document, I'm talking about 500 bytes per document. This gives a total DB size of nearly 200GB (with indexes). Certainly too large for memory.
*Additionally, each insert/delete is going to require updating each index on the collection...* *Depending on your use case and if this is a one-time migration, an alternative approach may be to insert the 66 million documents you intend to keep into a new collection and then recreate the indexes you need and swap the collections. * I believe that this reasoning does indeed apply to my use case. The python script (which is 10x faster) is scanning the rows and inserting the wanted entries into an entirely new Mongod instance (on another server). I was surprised that even including network latency,. this approach is still faster. If updates to large collections (IE, Remove) are, indeed, considerably slow, is this the generally accepted alternative?
On Friday, August 3, 2012 2:55:41 PM UTC-4, Jeremy Mikola wrote:
> On Friday, August 3, 2012 1:17:26 PM UTC-4, jordan wrote:
>> Is mongo simply not the correct database for 266 million "small" >> documents? Am I running the wrong query? It seems inadvisable to keep an >> index on every field, but sometimes I will have to run queries involving >> un-indexed fields.
> Without an index for the query, you're inviting a table scan across all > 266 million documents. Even if your entire working set fits in memory, the > initial page faults are going to cause disk to be read into memory the > first time. Additionally, each insert/delete is going to require updating > each index on the collection.
> Depending on your use case and if this is a one-time migration, an > alternative approach may be to insert the 66 million documents you intend > to keep into a new collection and then recreate the indexes you need and > swap the collections. You could use the lack of concurrency in db.eval()<http://www.mongodb.org/display/DOCS/Server-side+Code+Execution#Server...> to > your advantage to drop the old collection and rename the newly created > collection atomically. Also, consider these are small documents, inserting > them fresh would help avoid the fragmentation you would expiration with a > mass-deletion (unless you're willing to compact()<http://www.mongodb.org/display/DOCS/compact+Command> > afterwards).
On Friday, August 3, 2012 3:38:30 PM UTC-4, jordan wrote:
> *Additionally, each insert/delete is going to require updating each index > on the collection...* > *Depending on your use case and if this is a one-time migration, an > alternative approach may be to insert the 66 million documents you intend > to keep into a new collection and then recreate the indexes you need and > swap the collections. * > I believe that this reasoning does indeed apply to my use case. The python > script (which is 10x faster) is scanning the rows and inserting the wanted > entries into an entirely new Mongod instance (on another server). I was > surprised that even including network latency,. this approach is still > faster. > If updates to large collections (IE, Remove) are, indeed, considerably > slow, is this the generally accepted alternative?
It's one alternative. Basically, when removing documents, we need to (a) clear out any references to those documents in the collection's indexes and (b) add those documents' disk locations to the free list (DB internals). If all of your indexes cannot be contained in memory, (a) may involve frequent disk access. Meanwhile, (b) is definitely going to involve scattered disk access. Overall, this remove operation would be bound by disk IO.
Another option would be to drop all indexes on the collection, proceed with the remove query, and then recreate indexes for the 66 million retained documents. The benefit of this is highly dependent on your schema (how many indexes) and system (combined index size vs. memory) and may provide neglible improvement in practice.
Yeah, it sounds like you were churning index and data into RAM to do the
remove. I assume your indexes are bigger than RAM, also? So you're page
faulting on the index updates as well as the remove. Dropping the indexes
probably would have helped if this remove operation takes precedence over
queries.
On Fri, Aug 3, 2012 at 4:46 PM, Jeremy Mikola <jmik...@gmail.com> wrote:
> On Friday, August 3, 2012 3:38:30 PM UTC-4, jordan wrote:
>> *Additionally, each insert/delete is going to require updating each
>> index on the collection...*
>> *Depending on your use case and if this is a one-time migration, an
>> alternative approach may be to insert the 66 million documents you intend
>> to keep into a new collection and then recreate the indexes you need and
>> swap the collections. *
>> I believe that this reasoning does indeed apply to my use case. The
>> python script (which is 10x faster) is scanning the rows and inserting the
>> wanted entries into an entirely new Mongod instance (on another server). I
>> was surprised that even including network latency,. this approach is still
>> faster.
>> If updates to large collections (IE, Remove) are, indeed, considerably
>> slow, is this the generally accepted alternative?
> It's one alternative. Basically, when removing documents, we need to (a)
> clear out any references to those documents in the collection's indexes and
> (b) add those documents' disk locations to the free list (DB internals). If
> all of your indexes cannot be contained in memory, (a) may involve frequent
> disk access. Meanwhile, (b) is definitely going to involve scattered disk
> access. Overall, this remove operation would be bound by disk IO.
> Another option would be to drop all indexes on the collection, proceed
> with the remove query, and then recreate indexes for the 66 million
> retained documents. The benefit of this is highly dependent on your schema
> (how many indexes) and system (combined index size vs. memory) and may
> provide neglible improvement in practice.
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb