best way to iterate through a large-ish collection?

1,903 views
Skip to first unread message

Korny Sietsma

unread,
Sep 26, 2010, 10:45:09 PM9/26/10
to mongod...@googlegroups.com
Hi folks;
I'm wondering what is "best practice" for when you want to process every document in a large collection in a (ruby) script.

I'm trying to build stats on a collection containing 33 million fairly complex documents; I'm currently traversing them by simply running:
@db['rawEvents'].find().each do |rawEvent|
  ... do something with the data
end

However, this takes a while to start (I'm assuming in building the cursor) and on my db it fails after a couple of hours with:
/home/ubuntu/.rvm/gems/ruby-1.9.2-p0/gems/mongo-1.0.8/lib/mongo/connection.rb:784:in `check_response_flags': Query response returned CURSOR_NOT_FOUND. Either an invalid cursor was specified, or the cursor may have timed out on the server. (Mongo::OperationFailure)

and in the Mongo logs all I can see is:
... a bunch of successful logs, then:
Mon Sep 27 01:36:56 [conn601] getmore ysa.rawEvents cid:627945061045667031 getMore: {}  bytes:1049629 nreturned:707 955ms
Mon Sep 27 01:39:07 [conn601] getmore ysa.rawEvents cid:627945061045667031 getMore: {}  bytes:1048698 nreturned:727 147ms
Mon Sep 27 01:40:04 [conn601] getmore ysa.rawEvents cid:627945061045667031 getMore: {}  bytes:1084876 nreturned:651 1271310283ms
Mon Sep 27 02:03:01 [conn601] getmore ysa.rawEvents cid:627945061045667031 getMore: {}  bytes:1048609 nreturned:651 105ms
Mon Sep 27 02:04:34 [conn601] getmore ysa.rawEvents cid:627945061045667031 getMore: {}  bytes:1051472 nreturned:644 589ms
Mon Sep 27 02:04:56 [conn601] getmore ysa.rawEvents cid:627945061045667031 getMore: {}  bytes:1051822 nreturned:687 118ms
Mon Sep 27 02:22:57 [conn601] getMore: cursorid not found ysa.rawEvents 627945061045667031
Mon Sep 27 02:22:57 [conn601] getmore ysa.rawEvents cid:627945061045667031 bytes:20 nreturned:0 134ms

Now, I'm fairly sure (well, I hope!) that there is no data corruption - I'm just guessing something timed out somewhere?  I'm not sure what's going on with that huge time at 1:40...

Given that I know nothing is writing to the collection, and I don't care about query order, is there some better way to process every document in the collection than this?

(server version is mongodb 1.6.2 running on Ubuntu 10.4 on an Amazon ec2 server, with the ruby client v 1.0.8)

- Korny

--
Kornelis Sietsma  korny at my surname dot com
kornys on twitter/fb/gtalk/gwave www.sietsma.com/korny
"Every jumbled pile of person has a thinking part
that wonders what the part that isn't thinking
isn't thinking of"

Alvin Richards

unread,
Sep 26, 2010, 10:50:18 PM9/26/10
to mongodb-user
The preferred way to do this is via a tailable cursor

http://www.mongodb.org/display/DOCS/Tailable+Cursors

-Alvin

On Sep 26, 7:45 pm, Korny Sietsma <ko...@sietsma.com> wrote:
> Hi folks;
> I'm wondering what is "best practice" for when you want to process every
> document in a large collection in a (ruby) script.
>
> I'm trying to build stats on a collection containing 33 million fairly
> complex documents; I'm currently traversing them by simply running:
> @db['rawEvents'].find().each do |rawEvent|
>   ... do something with the data
> end
>
> However, this takes a while to start (I'm assuming in building the cursor)
> and on my db it fails after a couple of hours with:
> /home/ubuntu/.rvm/gems/ruby-1.9.2-p0/gems/mongo-1.0.8/lib/mongo/connection. rb:784:in

Eliot Horowitz

unread,
Sep 26, 2010, 10:52:30 PM9/26/10
to mongod...@googlegroups.com
The problem is cursors timeout after 10 minutes, so if you're doing
client side processing, that could trigger.
There are 2 solutions: use NO_CURSOR_TIMEOUT option or make the batch
size smaller.

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>

kevin

unread,
Sep 26, 2010, 11:00:31 PM9/26/10
to mongod...@googlegroups.com
can you tell how to pass this option to pymongo NO_CURSOR_TIMEOUT?

will this fail if it takes over 10 mins?
for x in conn.db.table.find():
 do_with(x)

Eliot Horowitz

unread,
Sep 26, 2010, 11:02:53 PM9/26/10
to mongod...@googlegroups.com
Yes - if do_with for 1 batch takes too long, it'll fail.

Try:
for x in conn.db.table.find( timeout=False ):
 do_with(x)

kevin

unread,
Sep 26, 2010, 11:06:23 PM9/26/10
to mongod...@googlegroups.com
do you know what the default behavior for timeout is True or False for pymongo? can't tell from the docs

      - `timeout` (optional): if True, any returned cursor will be
        subject to the normal timeout behavior of the mongod
        process. Otherwise, the returned cursor will never timeout
        at the server. Care should be taken to ensure that cursors
        with timeout turned off are properly closed.

Eliot Horowitz

unread,
Sep 26, 2010, 11:07:12 PM9/26/10
to mongod...@googlegroups.com
default is True

kevin

unread,
Sep 26, 2010, 11:10:02 PM9/26/10
to mongod...@googlegroups.com
OK
can you change default to False, i don't think anyone would expect this to fail
thanks

Eliot Horowitz

unread,
Sep 26, 2010, 11:12:10 PM9/26/10
to mongod...@googlegroups.com
No - the timeout is very important.
Without it - buggy client code kill a server with an infinite number of cursors.
We can make the docs clearer so its more obvious how to change when you need to.

Korny Sietsma

unread,
Sep 26, 2010, 11:23:44 PM9/26/10
to mongod...@googlegroups.com

I'm still not clear on how a cursor I'm hitting about 2000 times a second can time out!

Trying now with:
    @collection.find({},:timeout=>false).each do |cursor|
      cursor.each do |rawEvent|
        ...
I'm not quite sure why the double-block is necessary either, but as long as it works, I'll be happy.

- Korny

Eliot Horowitz

unread,
Sep 27, 2010, 5:22:15 PM9/27/10
to mongod...@googlegroups.com
Its a matter of hitting on the server.
So if you get a lot of documents in a batch (4mb) then you won't hit
the server very often.
Reply all
Reply to author
Forward
0 new messages