Terrible iteration speed over rows.

jordan

unread,

Aug 3, 2012, 1:17:26 PM8/3/12

to mongod...@googlegroups.com

I'm trying to remove records that match a very simple query (involving one unindexed field).

There are about 266 million entries in the database (single instance, one machine), and about 200 million of them will wind up being deleted.

It looks like my command: db.collection.remove({ "field":"value" }) is deleting about 500 records per second.

By my calculations, this will take about 5 days to complete.

Is mongo simply not the correct database for 266 million "small" documents? Am I running the wrong query? It seems inadvisable to keep an index on every field, but sometimes I will have to run queries involving un-indexed fields.

Thank You--

--Jordan

Wes Freeman

unread,

Aug 3, 2012, 1:31:48 PM8/3/12

to mongod...@googlegroups.com

Can you share some mongostat output?

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

jordan

unread,

Aug 3, 2012, 2:48:09 PM8/3/12

to mongod...@googlegroups.com

I have switched to using a simple python script that iterates over a cursor for all records. This seems to be about 10X faster than remove().

I can post mongostat output once this query finishes (about one day). I don't want to destroy data while I'm iterating.

This is mongostat for the query that is executing currently (iterate manually). Stay tuned for an update.

insert query update delete getmore command flushes mapped vsize res non-mapped locked % idx miss % qr|qw ar|aw netIn netOut conn time

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:44:47

0 0 0 0 0 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 62b 1k 4 14:44:48

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:44:49

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:44:50

0 0 0 0 0 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 62b 1k 4 14:44:51

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:44:52

0 0 0 0 0 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 62b 1k 4 14:44:53

0 0 0 0 1 1 0 258g 518g 1.41g 260g 0 0 0|0 0|0 107b 4m 4 14:44:54

0 0 0 0 1 1 0 258g 518g 1.41g 260g 0 0 0|0 0|0 107b 4m 4 14:44:55

0 0 0 0 0 1 0 258g 518g 1.41g 260g 0 0 0|0 0|0 62b 1k 4 14:44:56

insert query update delete getmore command flushes mapped vsize res non-mapped locked % idx miss % qr|qw ar|aw netIn netOut conn time

0 0 0 0 1 1 0 258g 518g 1.41g 260g 0 0 0|0 0|0 107b 4m 4 14:44:57

0 0 0 0 1 1 1 258g 518g 1.41g 260g 0 0 0|0 0|0 107b 4m 4 14:44:58

0 0 0 0 0 1 0 258g 518g 1.41g 260g 0 0 0|0 0|0 62b 1k 4 14:44:59

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:45:00

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:45:01

0 0 0 0 0 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 62b 1k 4 14:45:02

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:45:03

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:45:04

0 0 0 0 0 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 62b 1k 4 14:45:05

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:45:06

insert query update delete getmore command flushes mapped vsize res non-mapped locked % idx miss % qr|qw ar|aw netIn netOut conn time

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 1|0 62b 1k 4 14:45:07

0 0 0 0 0 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:45:08

0 0 0 0 1 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 107b 4m 4 14:45:09

0 0 0 0 0 1 0 258g 518g 1.42g 260g 0 0 0|0 0|0 62b 1k 4 14:45:10

Wes Freeman

unread,

Aug 3, 2012, 2:53:11 PM8/3/12

to mongod...@googlegroups.com

Were you running remove() from the shell before? Or from a driver?

--

Jeremy Mikola

unread,

Aug 3, 2012, 2:55:41 PM8/3/12

to mongod...@googlegroups.com

On Friday, August 3, 2012 1:17:26 PM UTC-4, jordan wrote:

Is mongo simply not the correct database for 266 million "small" documents? Am I running the wrong query? It seems inadvisable to keep an index on every field, but sometimes I will have to run queries involving un-indexed fields.

Without an index for the query, you're inviting a table scan across all 266 million documents. Even if your entire working set fits in memory, the initial page faults are going to cause disk to be read into memory the first time. Additionally, each insert/delete is going to require updating each index on the collection.

Depending on your use case and if this is a one-time migration, an alternative approach may be to insert the 66 million documents you intend to keep into a new collection and then recreate the indexes you need and swap the collections. You could use the lack of concurrency in db.eval() to your advantage to drop the old collection and rename the newly created collection atomically. Also, consider these are small documents, inserting them fresh would help avoid the fragmentation you would expiration with a mass-deletion (unless you're willing to compact() afterwards).

jordan

unread,

Aug 3, 2012, 3:38:30 PM8/3/12

to mongod...@googlegroups.com

Wes,

Were you running remove() from the shell before? Or from a driver?

This was in the mongo shell.

Jeremy,

Even if your entire working set fits in memory, the initial page faults are going to cause disk to be read into memory the first time.

By "small" document, I'm talking about 500 bytes per document. This gives a total DB size of nearly 200GB (with indexes). Certainly too large for memory.

Additionally, each insert/delete is going to require updating each index on the collection...

Depending on your use case and if this is a one-time migration, an alternative approach may be to insert the 66 million documents you intend to keep into a new collection and then recreate the indexes you need and swap the collections.

I believe that this reasoning does indeed apply to my use case. The python script (which is 10x faster) is scanning the rows and inserting the wanted entries into an entirely new Mongod instance (on another server). I was surprised that even including network latency,. this approach is still faster.

If updates to large collections (IE, Remove) are, indeed, considerably slow, is this the generally accepted alternative?

Thank You

--Jordan

Jeremy Mikola

unread,

Aug 3, 2012, 4:46:53 PM8/3/12

to mongod...@googlegroups.com

On Friday, August 3, 2012 3:38:30 PM UTC-4, jordan wrote:

Additionally, each insert/delete is going to require updating each index on the collection...
Depending on your use case and if this is a one-time migration, an alternative approach may be to insert the 66 million documents you intend to keep into a new collection and then recreate the indexes you need and swap the collections.
I believe that this reasoning does indeed apply to my use case. The python script (which is 10x faster) is scanning the rows and inserting the wanted entries into an entirely new Mongod instance (on another server). I was surprised that even including network latency,. this approach is still faster.
If updates to large collections (IE, Remove) are, indeed, considerably slow, is this the generally accepted alternative?

It's one alternative. Basically, when removing documents, we need to (a) clear out any references to those documents in the collection's indexes and (b) add those documents' disk locations to the free list (DB internals). If all of your indexes cannot be contained in memory, (a) may involve frequent disk access. Meanwhile, (b) is definitely going to involve scattered disk access. Overall, this remove operation would be bound by disk IO.

Another option would be to drop all indexes on the collection, proceed with the remove query, and then recreate indexes for the 66 million retained documents. The benefit of this is highly dependent on your schema (how many indexes) and system (combined index size vs. memory) and may provide neglible improvement in practice.

Wes Freeman

unread,

Aug 3, 2012, 4:54:52 PM8/3/12

to mongod...@googlegroups.com

Yeah, it sounds like you were churning index and data into RAM to do the remove. I assume your indexes are bigger than RAM, also? So you're page faulting on the index updates as well as the remove. Dropping the indexes probably would have helped if this remove operation takes precedence over queries.

--

Reply all

Reply to author

Forward