Message from discussion
best way to iterate through a large-ish collection?
Received: by 10.216.178.201 with SMTP id f51mr309765wem.8.1285557861588;
Sun, 26 Sep 2010 20:24:21 -0700 (PDT)
X-BeenThere: mongodb-user@googlegroups.com
Received: by 10.216.237.165 with SMTP id y37ls685031weq.1.p; Sun, 26 Sep 2010
20:24:16 -0700 (PDT)
Received: by 10.216.161.9 with SMTP id v9mr318829wek.5.1285557856003;
Sun, 26 Sep 2010 20:24:16 -0700 (PDT)
Received: by 10.216.161.9 with SMTP id v9mr318828wek.5.1285557855958;
Sun, 26 Sep 2010 20:24:15 -0700 (PDT)
Return-Path: <kor...@gmail.com>
Received: from mail-wy0-f179.google.com (mail-wy0-f179.google.com [74.125.82.179])
by gmr-mx.google.com with ESMTP id x37si1755757weq.8.2010.09.26.20.24.14;
Sun, 26 Sep 2010 20:24:14 -0700 (PDT)
Received-SPF: pass (google.com: domain of kor...@gmail.com designates 74.125.82.179 as permitted sender) client-ip=74.125.82.179;
Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of kor...@gmail.com designates 74.125.82.179 as permitted sender) smtp.mail=kor...@gmail.com; dkim=pass (test mode) header...@gmail.com
Received: by wya21 with SMTP id 21so1540795wya.10
for <mongodb-user@googlegroups.com>; Sun, 26 Sep 2010 20:24:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=domainkey-signature:received:mime-version:sender:received
:in-reply-to:references:from:date:x-google-sender-auth:message-id
:subject:to:content-type;
bh=FXbWJ+P1r/QuiHMbjZ7c9KdGFFRHK0F2zP1FEhuY9t8=;
b=Wvgw2c/8EnvTkyPePXgB9Q0FbrnEMEn27fiQX/DZSQXB/2gL8JEH1xA4ay032soX83
VCcyGcXaHCKjlCE/covw8QfYQChNJ2v+PswA4gFIlT+bnqARHHX3bR/2ntpeKvvTXoN1
ezkAZu6VfhHETtkHFcfFOYThSvq2han2zx9qE=
DomainKey-Signature: a=rsa-sha1; c=nofws;
d=gmail.com; s=gamma;
h=mime-version:sender:in-reply-to:references:from:date
:x-google-sender-auth:message-id:subject:to:content-type;
b=m0+r9U25D39FbTOLdkPQaEEGL2XYQojPF/S1/ISwLSXZiAOiyQ0Y4p+Q2WKHwWj/ud
ThCUdI+yQckv72MUePEaijhk6bLkdkedcAvdwcyYcQI4f+Q66ysNZSx6aRiwjqyBd85N
HF1XzEXWtqJQFBibJyI71swomRKMIXorWv2vE=
Received: by 10.216.1.18 with SMTP id 18mr12266679wec.24.1285557854622; Sun,
26 Sep 2010 20:24:14 -0700 (PDT)
MIME-Version: 1.0
Sender: kor...@gmail.com
Received: by 10.216.9.21 with HTTP; Sun, 26 Sep 2010 20:23:44 -0700 (PDT)
In-Reply-To: <AANLkTimYrHASgX2COV7si58x5_v1iMvD3zAPp35vQ...@mail.gmail.com>
References: <AANLkTikecSdFy41cDxEqyFCCxioxg+E9LJMj_if8U...@mail.gmail.com> <AANLkTimYrHASgX2COV7si58x5_v1iMvD3zAPp35vQ...@mail.gmail.com>
From: Korny Sietsma <ko...@sietsma.com>
Date: Mon, 27 Sep 2010 13:23:44 +1000
Message-ID: <AANLkTi=_Y4i0NFEP46bOuvBoz3AWXDa+B7a3=LcGV...@mail.gmail.com>
Subject: Re: [mongodb-user] best way to iterate through a large-ish collection?
To: mongodb-user@googlegroups.com
Content-Type: multipart/alternative; boundary=0016364d31734356f0049135435a
--0016364d31734356f0049135435a
Content-Type: text/plain; charset=ISO-8859-1
I also found more on this at
http://www.mongodb.org/display/DOCS/Frequently+Asked+Questions+-+Ruby#FrequentlyAskedQuestions-Ruby-IkeepgettingCURSORNOTFOUNDexceptions.What%27shappening%3F
I'm still not clear on how a cursor I'm hitting about 2000 times a second
can time out!
Trying now with:
@collection.find({},:timeout=>false).each do |cursor|
cursor.each do |rawEvent|
...
I'm not quite sure why the double-block is necessary either, but as long as
it works, I'll be happy.
- Korny
On Mon, Sep 27, 2010 at 12:52 PM, Eliot Horowitz <eliothorow...@gmail.com>wrote:
> The problem is cursors timeout after 10 minutes, so if you're doing
> client side processing, that could trigger.
> There are 2 solutions: use NO_CURSOR_TIMEOUT option or make the batch
> size smaller.
>
> On Sun, Sep 26, 2010 at 10:45 PM, Korny Sietsma <ko...@sietsma.com> wrote:
> > Hi folks;
> > I'm wondering what is "best practice" for when you want to process every
> > document in a large collection in a (ruby) script.
> > I'm trying to build stats on a collection containing 33 million fairly
> > complex documents; I'm currently traversing them by simply running:
> > @db['rawEvents'].find().each do |rawEvent|
> > ... do something with the data
> > end
> > However, this takes a while to start (I'm assuming in building the
> cursor)
> > and on my db it fails after a couple of hours with:
> >
> /home/ubuntu/.rvm/gems/ruby-1.9.2-p0/gems/mongo-1.0.8/lib/mongo/connection.rb:784:in
> > `check_response_flags': Query response returned CURSOR_NOT_FOUND. Either
> an
> > invalid cursor was specified, or the cursor may have timed out on the
> > server. (Mongo::OperationFailure)
> > and in the Mongo logs all I can see is:
> > ... a bunch of successful logs, then:
> > Mon Sep 27 01:36:56 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {} bytes:1049629 nreturned:707 955ms
> > Mon Sep 27 01:39:07 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {} bytes:1048698 nreturned:727 147ms
> > Mon Sep 27 01:40:04 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {} bytes:1084876 nreturned:651 1271310283ms
> > Mon Sep 27 02:03:01 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {} bytes:1048609 nreturned:651 105ms
> > Mon Sep 27 02:04:34 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {} bytes:1051472 nreturned:644 589ms
> > Mon Sep 27 02:04:56 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > getMore: {} bytes:1051822 nreturned:687 118ms
> > Mon Sep 27 02:22:57 [conn601] getMore: cursorid not found ysa.rawEvents
> > 627945061045667031
> > Mon Sep 27 02:22:57 [conn601] getmore ysa.rawEvents
> cid:627945061045667031
> > bytes:20 nreturned:0 134ms
> > Now, I'm fairly sure (well, I hope!) that there is no data corruption -
> I'm
> > just guessing something timed out somewhere? I'm not sure what's going
> on
> > with that huge time at 1:40...
> > Given that I know nothing is writing to the collection, and I don't care
> > about query order, is there some better way to process every document in
> the
> > collection than this?
> > (server version is mongodb 1.6.2 running on Ubuntu 10.4 on an Amazon ec2
> > server, with the ruby client v 1.0.8)
> > - Korny
> > --
> > Kornelis Sietsma korny at my surname dot com
> > kornys on twitter/fb/gtalk/gwave www.sietsma.com/korny
> > "Every jumbled pile of person has a thinking part
> > that wonders what the part that isn't thinking
> > isn't thinking of"
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongodb-user@googlegroups.com.
> > To unsubscribe from this group, send email to
> > mongodb-user+unsubscribe@googlegroups.com<mongodb-user%2Bunsubscribe@googlegroups.com>
> .
> > For more options, visit this group at
> > http://groups.google.com/group/mongodb-user?hl=en.
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com<mongodb-user%2Bunsubscribe@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>
>
--
Kornelis Sietsma korny at my surname dot com
kornys on twitter/fb/gtalk/gwave www.sietsma.com/korny
"Every jumbled pile of person has a thinking part
that wonders what the part that isn't thinking
isn't thinking of"
--0016364d31734356f0049135435a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<div>I also found more on this at=A0<a href=3D"http://www.mongodb.org/displ=
ay/DOCS/Frequently+Asked+Questions+-+Ruby#FrequentlyAskedQuestions-Ruby-Ike=
epgettingCURSORNOTFOUNDexceptions.What%27shappening%3F">http://www.mongodb.=
org/display/DOCS/Frequently+Asked+Questions+-+Ruby#FrequentlyAskedQuestions=
-Ruby-IkeepgettingCURSORNOTFOUNDexceptions.What%27shappening%3F</a></div>
<div><br></div><div>I'm still not clear on how a cursor I'm hitting=
about 2000 times a second can time out!</div><div><br></div>Trying now wit=
h:<div><div>=A0=A0 =...@collection.find({},:timeout=3D>false).each do |cu=
rsor|</div>
<div>=A0=A0 =A0 =A0cursor.each do |rawEvent|</div><div>=A0=A0 =A0 =A0 =A0..=
.</div><div>I'm not quite sure why the double-block is necessary either=
, but as long as it works, I'll be happy.</div><div><br></div><div>- Ko=
rny<div><br><div class=3D"gmail_quote">
On Mon, Sep 27, 2010 at 12:52 PM, Eliot Horowitz <span dir=3D"ltr"><<a h=
ref=3D"mailto:eliothorow...@gmail.com">eliothorow...@gmail.com</a>></spa=
n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex;">
The problem is cursors timeout after 10 minutes, so if you're doing<br>
client side processing, that could trigger.<br>
There are 2 solutions: use NO_CURSOR_TIMEOUT option or make the batch<br>
size smaller.<br>
<div><div></div><div class=3D"h5"><br>
On Sun, Sep 26, 2010 at 10:45 PM, Korny Sietsma <<a href=3D"mailto:korny=
@sietsma.com">ko...@sietsma.com</a>> wrote:<br>
> Hi folks;<br>
> I'm wondering what is "best practice" for when you want =
to process every<br>
> document in a large collection in a (ruby) script.<br>
> I'm trying to build stats on a collection containing 33 million fa=
irly<br>
> complex documents; I'm currently traversing them by simply running=
:<br>
> @db['rawEvents'].find().each do |rawEvent|<br>
> =A0=A0... do something with the data<br>
> end<br>
> However, this takes a while to start (I'm assuming in building the=
cursor)<br>
> and on my db it fails after a couple of hours with:<br>
> /home/ubuntu/.rvm/gems/ruby-1.9.2-p0/gems/mongo-1.0.8/lib/mongo/connec=
tion.rb:784:in<br>
> `check_response_flags': Query response returned CURSOR_NOT_FOUND. =
Either an<br>
> invalid cursor was specified, or the cursor may have timed out on the<=
br>
> server. (Mongo::OperationFailure)<br>
> and in the Mongo logs all I can see is:<br>
> ... a bunch of successful logs, then:<br>
> Mon Sep 27 01:36:56 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
> getMore: {} =A0bytes:1049629 nreturned:707 955ms<br>
> Mon Sep 27 01:39:07 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
> getMore: {} =A0bytes:1048698 nreturned:727 147ms<br>
> Mon Sep 27 01:40:04 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
> getMore: {} =A0bytes:1084876 nreturned:651 1271310283ms<br>
> Mon Sep 27 02:03:01 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
> getMore: {} =A0bytes:1048609 nreturned:651 105ms<br>
> Mon Sep 27 02:04:34 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
> getMore: {} =A0bytes:1051472 nreturned:644 589ms<br>
> Mon Sep 27 02:04:56 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
> getMore: {} =A0bytes:1051822 nreturned:687 118ms<br>
> Mon Sep 27 02:22:57 [conn601] getMore: cursorid not found ysa.rawEvent=
s<br>
> 627945061045667031<br>
> Mon Sep 27 02:22:57 [conn601] getmore ysa.rawEvents cid:62794506104566=
7031<br>
> bytes:20 nreturned:0 134ms<br>
> Now, I'm fairly sure (well, I hope!) that there is no data corrupt=
ion - I'm<br>
> just guessing something timed out somewhere? =A0I'm not sure what&=
#39;s going on<br>
> with that huge time at 1:40...<br>
> Given that I know nothing is writing to the collection, and I don'=
t care<br>
> about query order, is there some better way to process every document =
in the<br>
> collection than this?<br>
> (server version is mongodb 1.6.2 running on Ubuntu 10.4 on an Amazon e=
c2<br>
> server, with the ruby client v 1.0.8)<br>
> - Korny<br>
> --<br>
> Kornelis Sietsma=A0 korny at my surname dot com<br>
> kornys on twitter/fb/gtalk/gwave <a href=3D"http://www.sietsma.com/kor=
ny" target=3D"_blank">www.sietsma.com/korny</a><br>
> "Every jumbled pile of person has a thinking part<br>
> that wonders what the part that isn't thinking<br>
> isn't thinking of"<br>
><br>
</div></div>> --<br>
<div><div></div><div class=3D"h5">> You received this message because yo=
u are subscribed to the Google Groups<br>
> "mongodb-user" group.<br>
> To post to this group, send email to <a href=3D"mailto:mongodb-user@go=
oglegroups.com">mongodb-user@googlegroups.com</a>.<br>
> To unsubscribe from this group, send email to<br>
> <a href=3D"mailto:mongodb-user%2Bunsubscribe@googlegroups.com">mongodb=
-user+unsubscribe@googlegroups.com</a>.<br>
> For more options, visit this group at<br>
> <a href=3D"http://groups.google.com/group/mongodb-user?hl=3Den" target=
=3D"_blank">http://groups.google.com/group/mongodb-user?hl=3Den</a>.<br>
><br>
<br>
--<br>
You received this message because you are subscribed to the Google Groups &=
quot;mongodb-user" group.<br>
To post to this group, send email to <a href=3D"mailto:mongodb-user@googleg=
roups.com">mongodb-user@googlegroups.com</a>.<br>
To unsubscribe from this group, send email to <a href=3D"mailto:mongodb-use=
r%2Bunsubscribe@googlegroups.com">mongodb-user+unsubscribe@googlegroups.com=
</a>.<br>
For more options, visit this group at <a href=3D"http://groups.google.com/g=
roup/mongodb-user?hl=3Den" target=3D"_blank">http://groups.google.com/group=
/mongodb-user?hl=3Den</a>.<br>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Kornelis Si=
etsma=A0 korny at my surname dot com<br>kornys on twitter/fb/gtalk/gwave <a=
href=3D"http://www.sietsma.com/korny">www.sietsma.com/korny</a><br>"E=
very jumbled pile of person has a thinking part<br>
that wonders what the part that isn't thinking<br>isn't thinking of=
"<br>
</div></div></div>
--0016364d31734356f0049135435a--