Google Bot Is Your Enemy

285 views
Skip to first unread message

Brandon Wirtz

unread,
Sep 3, 2011, 3:35:31 PM9/3/11
to google-a...@googlegroups.com

The biggest Problem I have with the Scheduler/GAE isn’t GAE it is Google Bot.

 

Under the new model you are on the hook for 15 minutes of time for an instance that spins up.  Google Bot can’t be throttled on GAE.  If you go in to WebMasters Tools you get a “Your site has been assigned a special crawl rate”.  This was my favorite feature when I was paying for CPU cycles, but now when Google bot shows up and makes upwards of 100k requests in 5 minutes every 6 hours, Google Bot will be accounting for about 85% of the cost of hosting on several of the sites that have less than $50 a month in hosting costs under the current plan, that will be paying about $400 a month to serve requests to Google come November Pricing.

 

-Brandon

 

Joshua Smith

unread,
Sep 3, 2011, 4:15:46 PM9/3/11
to google-a...@googlegroups.com
Is there some way to get google bot requests to go to a dedicated back end?

-- 
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Darien Caldwell

unread,
Sep 3, 2011, 5:22:09 PM9/3/11
to Google App Engine
Nice, so basically Google is using their own service to tack on
additional charges to your bill. Doesn't sound ethical.

Rajkumar Radhakrishnan

unread,
Sep 3, 2011, 9:40:51 PM9/3/11
to google-a...@googlegroups.com
@Brandon :

This is the case for one of my popular web-sites too. I believe if latency time increases, Google bot will automatically scale down its crawl rate and I also fear that such an increased latency will have a negative effect on the page ranking. 

By the way, have you started using 304 (Not Modified) responses, for pages which have not modified ? This can reduce the resource usage by Google Bot.

Worst case option is to resort to a check for every Nth request (10 < N < 100) from the bot (of course, using memcache) and send a 503 status..

503 Service Unavailable 
The server is currently unavailable (because it is overloaded or down for maintenance). Generally, this is a temporary state.

..and this should also give a hint to the Google bot to scale down its crawl rate. This will be useful when you want to retain better latency and want to hint Google bot alone. Again, this can have an effect on the page rank too. And I am not sure which is worse bad latency or a 503.

Anyone else has experience in this space ?

Thanks & Regards,
Raj

On Sun, Sep 4, 2011 at 2:52 AM, Darien Caldwell <darien....@gmail.com> wrote:
Nice, so basically Google is using their own service to tack on
additional charges to your bill. Doesn't sound ethical.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Build online database applications, over Google App Engine.
iFreeTools Creator - http://creator.ifreetools.com


johnP

unread,
Sep 3, 2011, 9:50:32 PM9/3/11
to Google App Engine
I think in webmaster tools you can change crawl-rate preferences.
Otherwise - just block it with robots.txt.

saintthor

unread,
Sep 3, 2011, 11:03:49 PM9/3/11
to Google App Engine
how about desallow googlebot in robots.txt or in code?

googlebot is my biggest access source too.

Brandon Wirtz

unread,
Sep 4, 2011, 12:07:09 AM9/4/11
to google-a...@googlegroups.com

Returning a 503 is REALLY REALLY Bad for SEO.

 

304 Seems to be ignored by Google Bot on GAE, but also Google Bot will try queries to which there are no links, and which no user has ever made.

 

If your latency goes higher Google Bot will throttle back…  but the only way to slow down the page serving is to put a wait timer, which burns Instance time… Rob Paul to Pay Peter…

zdravko

unread,
Sep 4, 2011, 3:49:50 AM9/4/11
to Google App Engine
And what would be wrong if google was to constrain their bots to visit
GAE pages only once per week?

Or even better, what if GAE had a mechanism by which the apps could
announce when they have something new to be crawled?


On Sep 4, 12:07 am, "Brandon Wirtz" <drak...@digerat.com> wrote:
> Returning a 503 is REALLY REALLY Bad for SEO.
>
> 304 Seems to be ignored by Google Bot on GAE, but also Google Bot will try
> queries to which there are no links, and which no user has ever made.
>
> If your latency goes higher Google Bot will throttle back.  but the only way
> to slow down the page serving is to put a wait timer, which burns Instance
> time. Rob Paul to Pay Peter.
> On Sun, Sep 4, 2011 at 2:52 AM, Darien Caldwell <darien.caldw...@gmail.com>
> wrote:
>
> Nice, so basically Google is using their own service to tack on
> additional charges to your bill. Doesn't sound ethical.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com
> <mailto:google-appengine%2Bunsu...@googlegroups.com> .
> For more options, visit this group athttp://groups.google.com/group/google-appengine?hl=en.
>
> --
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Build online database applications, over Google App Engine.
>
> iFreeTools Creator -http://creator.ifreetools.com

Brandon Wirtz

unread,
Sep 4, 2011, 4:40:15 AM9/4/11
to google-a...@googlegroups.com
I would settle for plays by the same rules as all other websites.

Though I do kind of think that the bandwidth consumed by GOOG on GOOG
infrastructure should be free.

Stephen

unread,
Sep 4, 2011, 9:54:57 AM9/4/11
to google-a...@googlegroups.com
On Sun, Sep 4, 2011 at 8:49 AM, zdravko <email.w...@gmail.com> wrote:
>
> Or even better, what if GAE had a mechanism by which the apps could
> announce when they have something new to be crawled?

If only...

http://www.sitemaps.org/

Tim Hoffman

unread,
Sep 4, 2011, 10:07:56 AM9/4/11
to google-a...@googlegroups.com
+1 from me on that score Brandon.  

T

Greg

unread,
Sep 5, 2011, 12:19:43 AM9/5/11
to Google App Engine
On Sep 4, 8:40 pm, "Brandon Wirtz" <drak...@digerat.com> wrote:
> Though I do kind of think that the bandwidth consumed by GOOG on GOOG
> infrastructure should be free.

+1. It will be entirely internal to their network, and should be
extremely cheap if not free.

Sergey Schetinin

unread,
Sep 5, 2011, 12:41:12 AM9/5/11
to google-a...@googlegroups.com
I might not be entirely internal network traffic, but at least the communication between the appengine apps should be discounted.

Steve Sherrie

unread,
Sep 3, 2011, 4:26:48 PM9/3/11
to Google App Engine
Joshua,

I just read your post about idle instances, and scheduling, and it
makes a lot of sense. Also given this google bot stuff, I wonder if it
would be useful to have the ability to filter traffic into different
'types', each of which have different scheduling requirements.

Steve

stevesherrie

unread,
Sep 3, 2011, 7:08:24 PM9/3/11
to Google App Engine
I've created this feature request (http://code.google.com/p/
googleappengine/issues/detail?id=5775) for the ability to create
scheduler profiles, filter requests into one of these profiles, and
maybe even route requests to a particular instance/backend.

Star it if you think it will be helpful.

btw, sorry if this posts twice, I'm having a real buggy time with
google groups here.

Steve


On Sep 3, 4:15 pm, Joshua Smith <JoshuaESm...@charter.net> wrote:

Steve Sherrie

unread,
Sep 3, 2011, 5:22:27 PM9/3/11
to Google App Engine
Joshua, can you see my posts? Not sure why I can't post....

On Sep 3, 4:15 pm, Joshua Smith <JoshuaESm...@charter.net> wrote:

Roch Delsalle

unread,
Sep 11, 2011, 8:01:52 AM9/11/11
to google-a...@googlegroups.com
Here is what I noticied : http://www.d-ro.ch/2011/09/appengine-cloudflare-crawlrate
Anyway you shouldn't block Googlebot that would de-index you form google.

Brandon Wirtz

unread,
Sep 11, 2011, 3:13:24 PM9/11/11
to google-a...@googlegroups.com

Your crawl rate changed because the Headers for serving changed from GAE to CF In a Week they will be back to normal.

 

 

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Roch Delsalle
Sent: Sunday, September 11, 2011 5:02 AM
To: google-a...@googlegroups.com
Subject: [google-appengine] Re: Google Bot Is Your Enemy

 

Anyway you shouldn't block Googlebot that would de-index you form google.

--

You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/ah8AAXrX-kMJ.

Roch Delsalle

unread,
Sep 11, 2011, 5:16:37 PM9/11/11
to google-a...@googlegroups.com
Ok, I agree that must have had an impact but how do you explain one month down to "almost" zero when my headers were sent by gae ?

Brandon Wirtz

unread,
Sep 11, 2011, 5:31:03 PM9/11/11
to google-a...@googlegroups.com

Without the Domain name, I don’t. But a trick you can use to get deeper indexing on your pages is to change the server headers, IP address, and Expiration every week or so.

 

But Google bot responds to Change with “lets crawl everything” when it finds less change, it reduces the crawl.

 

 

From: google-a...@googlegroups.com [mailto:google-a...@googlegroups.com] On Behalf Of Roch Delsalle


Sent: Sunday, September 11, 2011 2:17 PM
To: google-a...@googlegroups.com

Reply all
Reply to author
Forward
0 new messages