Googlebot making inefficient calls

35 views
Skip to first unread message

Guido van Rossum

unread,
Apr 16, 2012, 3:31:59 PM4/16/12
to codereview-discuss
I was looking at some data recorded with Appstats recently and found
that the Googlebot makes some pretty inefficient requests. For
example, I saw it go to
http://codereview.appspot.com/all?limit=10&closed=1&offset=2720 --
note the huge offset! It's easy to imagine how it got there -- it
blinkdly follows the "older" link on the "Closed issues" tab.

These queries are very efficient -- Appstats showed that to get a
batch of 10 records starting at 2720, a RunQuery followed by two Next
calls are made, where the RunQuery and the first Next calls return no
results but set the "skipped_results" value in the batch to some large
number.

Remember that offset queries must internally produce all the matches
and then skip the first <offset> results; to protect itself the
datastore stops skipping and sends a partial result when it has
skipped many results (maybe the limit is 1000).

These queries would be much more efficient using cursors.

We may also want to robots (or all users) to go away if the offset is too large.

--
--Guido van Rossum (python.org/~guido)

Andi Albrecht

unread,
Apr 17, 2012, 1:57:41 PM4/17/12
to Guido van Rossum, codereview-discuss
On Mon, Apr 16, 2012 at 9:31 PM, Guido van Rossum <gu...@python.org> wrote:
> I was looking at some data recorded with Appstats recently and found
> that the Googlebot makes some pretty inefficient requests. For
> example, I saw it go to
> http://codereview.appspot.com/all?limit=10&closed=1&offset=2720 --
> note the huge offset! It's easy to imagine how it got there -- it
> blinkdly follows the "older" link on the "Closed issues" tab.

Thanks for bringing this up!

>
> These queries are very efficient -- Appstats showed that to get a
> batch of 10 records starting at 2720, a RunQuery followed by two Next
> calls are made, where the RunQuery and the first Next calls return no
> results but set the "skipped_results" value in the batch to some large
> number.

I suppose you've meant "inefficient" here?

>
> Remember that offset queries must internally produce all the matches
> and then skip the first <offset> results; to protect itself the
> datastore stops skipping and sends a partial result when it has
> skipped many results (maybe the limit is 1000).
>
> These queries would be much more efficient using cursors.

IMO this is worth an issue on the Rietveld tracker :)
The paging on the issue pages is very old code and IIRC it already
exists before cursors were available on App Engine. We should
modernize this code.

It's not directly related, but I've seen some exceptions regarding
invalid cursors caused by bots in the logs too (search page?). What
would be a "bot-friendly" way to handle invalid cursors as URL
parameters? Does anyone have a good example how to deal with this?

--
Andi

>
> We may also want to robots (or all users) to go away if the offset is too large.
>
> --
> --Guido van Rossum (python.org/~guido)
>

> --
> You received this message because you are subscribed to the Google Groups "codereview-discuss" group.
> To post to this group, send email to coderevie...@googlegroups.com.
> To unsubscribe from this group, send email to codereview-disc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/codereview-discuss?hl=en.
>

Guido van Rossum

unread,
Apr 17, 2012, 2:14:47 PM4/17/12
to Andi Albrecht, codereview-discuss
On Tue, Apr 17, 2012 at 10:57 AM, Andi Albrecht
<albrec...@googlemail.com> wrote:
> On Mon, Apr 16, 2012 at 9:31 PM, Guido van Rossum <gu...@python.org> wrote:
>> I was looking at some data recorded with Appstats recently and found
>> that the Googlebot makes some pretty inefficient requests. For
>> example, I saw it go to
>> http://codereview.appspot.com/all?limit=10&closed=1&offset=2720 --
>> note the huge offset! It's easy to imagine how it got there -- it
>> blinkdly follows the "older" link on the "Closed issues" tab.
>
> Thanks for bringing this up!
>
>>
>> These queries are very efficient -- Appstats showed that to get a
>> batch of 10 records starting at 2720, a RunQuery followed by two Next
>> calls are made, where the RunQuery and the first Next calls return no
>> results but set the "skipped_results" value in the batch to some large
>> number.
>
> I suppose you've meant "inefficient" here?

Yes. ;-)

>> Remember that offset queries must internally produce all the matches
>> and then skip the first <offset> results; to protect itself the
>> datastore stops skipping and sends a partial result when it has
>> skipped many results (maybe the limit is 1000).
>>
>> These queries would be much more efficient using cursors.
>
> IMO this is worth an issue on the Rietveld tracker :)
> The paging on the issue pages is very old code and IIRC it already
> exists before cursors were available on App Engine. We should
> modernize this code.

Right, that was my thought too...

> It's not directly related, but I've seen some exceptions regarding
> invalid cursors caused by bots in the logs too (search page?). What
> would be a "bot-friendly" way to handle invalid cursors as URL
> parameters? Does anyone have a good example how to deal with this?

What kind of errors? Could it be that the bot remembers a URL and then
comes back to it weeks later? Do we care?

Andi Albrecht

unread,
Apr 17, 2012, 3:34:50 PM4/17/12
to Guido van Rossum, codereview-discuss

The exception is: BadRequestError: query app is 's~codereview-hr' but
cursor.position.key app is 'codereview'

It seems that the googlebot called some URLs cached before the hr
transition. At least I don't see those errors in the most recent
version anymore.

Guido van Rossum

unread,
Apr 17, 2012, 3:40:30 PM4/17/12
to Andi Albrecht, codereview-discuss
On Tue, Apr 17, 2012 at 12:34 PM, Andi Albrecht

Yeah, let's just ignore that. It will go away eventually.

Reply all
Reply to author
Forward
0 new messages