Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion best way to iterate through a large-ish collection?

Received: by 10.216.178.201 with SMTP id f51mr309765wem.8.1285557861588;
        Sun, 26 Sep 2010 20:24:21 -0700 (PDT)
X-BeenThere: mongodb-user@googlegroups.com
Received: by 10.216.237.165 with SMTP id y37ls685031weq.1.p; Sun, 26 Sep 2010
 20:24:16 -0700 (PDT)
Received: by 10.216.161.9 with SMTP id v9mr318829wek.5.1285557856003;
        Sun, 26 Sep 2010 20:24:16 -0700 (PDT)
Received: by 10.216.161.9 with SMTP id v9mr318828wek.5.1285557855958;
        Sun, 26 Sep 2010 20:24:15 -0700 (PDT)
Return-Path: <kor...@gmail.com>
Received: from mail-wy0-f179.google.com (mail-wy0-f179.google.com [74.125.82.179])
        by gmr-mx.google.com with ESMTP id x37si1755757weq.8.2010.09.26.20.24.14;
        Sun, 26 Sep 2010 20:24:14 -0700 (PDT)
Received-SPF: pass (google.com: domain of kor...@gmail.com designates 74.125.82.179 as permitted sender) client-ip=74.125.82.179;
Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of kor...@gmail.com designates 74.125.82.179 as permitted sender) smtp.mail=kor...@gmail.com; dkim=pass (test mode) header...@gmail.com
Received: by wya21 with SMTP id 21so1540795wya.10
        for <mongodb-user@googlegroups.com>; Sun, 26 Sep 2010 20:24:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=gamma;
        h=domainkey-signature:received:mime-version:sender:received
         :in-reply-to:references:from:date:x-google-sender-auth:message-id
         :subject:to:content-type;
        bh=FXbWJ+P1r/QuiHMbjZ7c9KdGFFRHK0F2zP1FEhuY9t8=;
        b=Wvgw2c/8EnvTkyPePXgB9Q0FbrnEMEn27fiQX/DZSQXB/2gL8JEH1xA4ay032soX83
         VCcyGcXaHCKjlCE/covw8QfYQChNJ2v+PswA4gFIlT+bnqARHHX3bR/2ntpeKvvTXoN1
         ezkAZu6VfhHETtkHFcfFOYThSvq2han2zx9qE=
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:from:date
         :x-google-sender-auth:message-id:subject:to:content-type;
        b=m0+r9U25D39FbTOLdkPQaEEGL2XYQojPF/S1/ISwLSXZiAOiyQ0Y4p+Q2WKHwWj/ud
         ThCUdI+yQckv72MUePEaijhk6bLkdkedcAvdwcyYcQI4f+Q66ysNZSx6aRiwjqyBd85N
         HF1XzEXWtqJQFBibJyI71swomRKMIXorWv2vE=
Received: by 10.216.1.18 with SMTP id 18mr12266679wec.24.1285557854622; Sun,
 26 Sep 2010 20:24:14 -0700 (PDT)
MIME-Version: 1.0
Sender: kor...@gmail.com
Received: by 10.216.9.21 with HTTP; Sun, 26 Sep 2010 20:23:44 -0700 (PDT)
In-Reply-To: <AANLkTimYrHASgX2COV7si58x5_v1iMvD3zAPp35vQ...@mail.gmail.com>
References: <AANLkTikecSdFy41cDxEqyFCCxioxg+E9LJMj_if8U...@mail.gmail.com> <AANLkTimYrHASgX2COV7si58x5_v1iMvD3zAPp35vQ...@mail.gmail.com>
From: Korny Sietsma <ko...@sietsma.com>
Date: Mon, 27 Sep 2010 13:23:44 +1000
Message-ID: <AANLkTi=_Y4i0NFEP46bOuvBoz3AWXDa+B7a3=LcGV...@mail.gmail.com>
Subject: Re: [mongodb-user] best way to iterate through a large-ish collection?
To: mongodb-user@googlegroups.com
Content-Type: multipart/alternative; boundary=0016364d31734356f0049135435a

--0016364d31734356f0049135435a
Content-Type: text/plain; charset=ISO-8859-1

I also found more on this at
http://www.mongodb.org/display/DOCS/Frequently+Asked+Questions+-+Ruby#FrequentlyAskedQuestions-Ruby-IkeepgettingCURSORNOTFOUNDexceptions.What%27shappening%3F

I'm still not clear on how a cursor I'm hitting about 2000 times a second
can time out!

Trying now with:
    @collection.find({},:timeout=>false).each do |cursor|
      cursor.each do |rawEvent|
        ...
I'm not quite sure why the double-block is necessary either, but as long as
it works, I'll be happy.

- Korny

On Mon, Sep 27, 2010 at 12:52 PM, Eliot Horowitz <eliothorow...@gmail.com>wrote:

> The problem is cursors timeout after 10 minutes, so if you're doing
> client side processing, that could trigger.
> There are 2 solutions: use NO_CURSOR_TIMEOUT option or make the batch
> size smaller.
>
> On Sun, Sep 26, 2010 at 10:45 PM, Korny Sietsma <ko...@sietsma.com> wrote:
> > Hi folks;
> > I'm wondering what is "best practice" for when you want to process every
> > document in a large collection in a (ruby) script.
> > I'm trying to build stats on a collection containing 33 million fairly
> > complex documents; I'm currently traversing them by simply running:
> > @db['rawEvents'].find().each do |rawEvent|
> >   ... do something with the data
> > end
> > However, this takes a while to start (I'm assuming in building the
> cursor)
> > and on my db it fails after a couple of hours with:
> >
> /home/ubuntu/.rvm/gems/ruby-1.9.2-p0/gems/mongo-1.0.8/lib/mongo/connection.rb:784:in
> > `check_response_flags': Query response returned CURSOR_NOT_FOUND. Either
> an
> > invalid cursor was specified, or the cursor may have timed out on the
> > server. (Mongo::OperationFailure)
> > and in the Mongo logs all I can see is:
> > ... a bunch of successful logs, then:
> > Mon Sep 27 01:36:56 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {}  bytes:1049629 nreturned:707 955ms
> > Mon Sep 27 01:39:07 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {}  bytes:1048698 nreturned:727 147ms
> > Mon Sep 27 01:40:04 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {}  bytes:1084876 nreturned:651 1271310283ms
> > Mon Sep 27 02:03:01 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {}  bytes:1048609 nreturned:651 105ms
> > Mon Sep 27 02:04:34 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {}  bytes:1051472 nreturned:644 589ms
> > Mon Sep 27 02:04:56 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {}  bytes:1051822 nreturned:687 118ms
> > Mon Sep 27 02:22:57 [conn601] getMore: cursorid not found ysa.rawEvents
> > 627945061045667031
> > Mon Sep 27 02:22:57 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > bytes:20 nreturned:0 134ms
> > Now, I'm fairly sure (well, I hope!) that there is no data corruption -
> I'm
> > just guessing something timed out somewhere?  I'm not sure what's going
> on
> > with that huge time at 1:40...
> > Given that I know nothing is writing to the collection, and I don't care
> > about query order, is there some better way to process every document in
> the
> > collection than this?
> > (server version is mongodb 1.6.2 running on Ubuntu 10.4 on an Amazon ec2
> > server, with the ruby client v 1.0.8)
> > - Korny
> > --
> > Kornelis Sietsma  korny at my surname dot com
> > kornys on twitter/fb/gtalk/gwave www.sietsma.com/korny
> > "Every jumbled pile of person has a thinking part
> > that wonders what the part that isn't thinking
> > isn't thinking of"
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongodb-user@googlegroups.com.
> > To unsubscribe from this group, send email to
> > mongodb-user+unsubscribe@googlegroups.com<mongodb-user%2Bunsubscribe@googlegroups.com>
> .
> > For more options, visit this group at
> > http://groups.google.com/group/mongodb-user?hl=en.
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com<mongodb-user%2Bunsubscribe@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>
>


-- 
Kornelis Sietsma  korny at my surname dot com
kornys on twitter/fb/gtalk/gwave www.sietsma.com/korny
"Every jumbled pile of person has a thinking part
that wonders what the part that isn't thinking
isn't thinking of"

--0016364d31734356f0049135435a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div>I also found more on this at=A0<a href=3D"http://www.mongodb.org/displ=
ay/DOCS/Frequently+Asked+Questions+-+Ruby#FrequentlyAskedQuestions-Ruby-Ike=
epgettingCURSORNOTFOUNDexceptions.What%27shappening%3F">http://www.mongodb.=
org/display/DOCS/Frequently+Asked+Questions+-+Ruby#FrequentlyAskedQuestions=
-Ruby-IkeepgettingCURSORNOTFOUNDexceptions.What%27shappening%3F</a></div>

<div><br></div><div>I&#39;m still not clear on how a cursor I&#39;m hitting=
 about 2000 times a second can time out!</div><div><br></div>Trying now wit=
h:<div><div>=A0=A0 =...@collection.find({},:timeout=3D&gt;false).each do |cu=
rsor|</div>

<div>=A0=A0 =A0 =A0cursor.each do |rawEvent|</div><div>=A0=A0 =A0 =A0 =A0..=
.</div><div>I&#39;m not quite sure why the double-block is necessary either=
, but as long as it works, I&#39;ll be happy.</div><div><br></div><div>- Ko=
rny<div><br><div class=3D"gmail_quote">

On Mon, Sep 27, 2010 at 12:52 PM, Eliot Horowitz <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:eliothorow...@gmail.com">eliothorow...@gmail.com</a>&gt;</spa=
n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex;">

The problem is cursors timeout after 10 minutes, so if you&#39;re doing<br>
client side processing, that could trigger.<br>
There are 2 solutions: use NO_CURSOR_TIMEOUT option or make the batch<br>
size smaller.<br>
<div><div></div><div class=3D"h5"><br>
On Sun, Sep 26, 2010 at 10:45 PM, Korny Sietsma &lt;<a href=3D"mailto:korny=
@sietsma.com">ko...@sietsma.com</a>&gt; wrote:<br>
&gt; Hi folks;<br>
&gt; I&#39;m wondering what is &quot;best practice&quot; for when you want =
to process every<br>
&gt; document in a large collection in a (ruby) script.<br>
&gt; I&#39;m trying to build stats on a collection containing 33 million fa=
irly<br>
&gt; complex documents; I&#39;m currently traversing them by simply running=
:<br>
&gt; @db[&#39;rawEvents&#39;].find().each do |rawEvent|<br>
&gt; =A0=A0... do something with the data<br>
&gt; end<br>
&gt; However, this takes a while to start (I&#39;m assuming in building the=
 cursor)<br>
&gt; and on my db it fails after a couple of hours with:<br>
&gt; /home/ubuntu/.rvm/gems/ruby-1.9.2-p0/gems/mongo-1.0.8/lib/mongo/connec=
tion.rb:784:in<br>
&gt; `check_response_flags&#39;: Query response returned CURSOR_NOT_FOUND. =
Either an<br>
&gt; invalid cursor was specified, or the cursor may have timed out on the<=
br>
&gt; server. (Mongo::OperationFailure)<br>
&gt; and in the Mongo logs all I can see is:<br>
&gt; ... a bunch of successful logs, then:<br>
&gt; Mon Sep 27 01:36:56 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
&gt; getMore: {} =A0bytes:1049629 nreturned:707 955ms<br>
&gt; Mon Sep 27 01:39:07 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
&gt; getMore: {} =A0bytes:1048698 nreturned:727 147ms<br>
&gt; Mon Sep 27 01:40:04 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
&gt; getMore: {} =A0bytes:1084876 nreturned:651 1271310283ms<br>
&gt; Mon Sep 27 02:03:01 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
&gt; getMore: {} =A0bytes:1048609 nreturned:651 105ms<br>
&gt; Mon Sep 27 02:04:34 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
&gt; getMore: {} =A0bytes:1051472 nreturned:644 589ms<br>
&gt; Mon Sep 27 02:04:56 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
&gt; getMore: {} =A0bytes:1051822 nreturned:687 118ms<br>
&gt; Mon Sep 27 02:22:57 [conn601] getMore: cursorid not found ysa.rawEvent=
s<br>
&gt; 627945061045667031<br>
&gt; Mon Sep 27 02:22:57 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
&gt; bytes:20 nreturned:0 134ms<br>
&gt; Now, I&#39;m fairly sure (well, I hope!) that there is no data corrupt=
ion - I&#39;m<br>
&gt; just guessing something timed out somewhere? =A0I&#39;m not sure what&=
#39;s going on<br>
&gt; with that huge time at 1:40...<br>
&gt; Given that I know nothing is writing to the collection, and I don&#39;=
t care<br>
&gt; about query order, is there some better way to process every document =
in the<br>
&gt; collection than this?<br>
&gt; (server version is mongodb 1.6.2 running on Ubuntu 10.4 on an Amazon e=
c2<br>
&gt; server, with the ruby client v 1.0.8)<br>
&gt; - Korny<br>
&gt; --<br>
&gt; Kornelis Sietsma=A0 korny at my surname dot com<br>
&gt; kornys on twitter/fb/gtalk/gwave <a href=3D"http://www.sietsma.com/kor=
ny" target=3D"_blank">www.sietsma.com/korny</a><br>
&gt; &quot;Every jumbled pile of person has a thinking part<br>
&gt; that wonders what the part that isn&#39;t thinking<br>
&gt; isn&#39;t thinking of&quot;<br>
&gt;<br>
</div></div>&gt; --<br>
<div><div></div><div class=3D"h5">&gt; You received this message because yo=
u are subscribed to the Google Groups<br>
&gt; &quot;mongodb-user&quot; group.<br>
&gt; To post to this group, send email to <a href=3D"mailto:mongodb-user@go=
oglegroups.com">mongodb-user@googlegroups.com</a>.<br>
&gt; To unsubscribe from this group, send email to<br>
&gt; <a href=3D"mailto:mongodb-user%2Bunsubscribe@googlegroups.com">mongodb=
-user+unsubscribe@googlegroups.com</a>.<br>
&gt; For more options, visit this group at<br>
&gt; <a href=3D"http://groups.google.com/group/mongodb-user?hl=3Den" target=
=3D"_blank">http://groups.google.com/group/mongodb-user?hl=3Den</a>.<br>
&gt;<br>
<br>
--<br>
You received this message because you are subscribed to the Google Groups &=
quot;mongodb-user&quot; group.<br>
To post to this group, send email to <a href=3D"mailto:mongodb-user@googleg=
roups.com">mongodb-user@googlegroups.com</a>.<br>
To unsubscribe from this group, send email to <a href=3D"mailto:mongodb-use=
r%2Bunsubscribe@googlegroups.com">mongodb-user+unsubscribe@googlegroups.com=
</a>.<br>
For more options, visit this group at <a href=3D"http://groups.google.com/g=
roup/mongodb-user?hl=3Den" target=3D"_blank">http://groups.google.com/group=
/mongodb-user?hl=3Den</a>.<br>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Kornelis Si=
etsma=A0 korny at my surname dot com<br>kornys on twitter/fb/gtalk/gwave <a=
 href=3D"http://www.sietsma.com/korny">www.sietsma.com/korny</a><br>&quot;E=
very jumbled pile of person has a thinking part<br>

that wonders what the part that isn&#39;t thinking<br>isn&#39;t thinking of=
&quot;<br>
</div></div></div>

--0016364d31734356f0049135435a--