I posted about this a while back, but with the new pricing in effect I’m extra sensitive.
Because we can’t set the Google Bot crawl rate in webmasters tools “Occassionally” the bot gets overzealous. I wouldn’t mind if the bot was crawling 2 times for every visitor that google sends my way… but it is currently at about 100 to 1 on some of the bad days.
I trimmed back the number of idle instances to slow the site down, but that is a horrible solution. Ramp up the time, so that the User Experience sucks more, is a hack not a “fix”.
This isn’t even the worst offender we killed a domain that was getting north of 300k crawled pages a day. By blocking the bot all together because the site was only meant for adwords landing pages, not for search rankings.
We have tried setting the cache headers longer, and longer thinking that would reduce the crawl. We started serving the oldest cached version of the page so that the pages wouldn’t change from visit to visit.
We removed Adsense from the pages so that pages/queries that were only shown once via a query would not ever hit Google’s radar and be included for crawls.
Normally sites love this kind of crawl rate, as an SEO I sell “deep indexing” as a reason to be on GAE, but there needs to be an “I’m tapping out” button cause the Google bot comes, reads my content, never buys anything, never clicks an ad, and won’t even click the “like this” button on facebook. Other than the once in a great wile that it tells its friends we are the top hit for “islam medical video editing scandal” (true story) we don’t gain much from the massive draw it puts on the server and the costs that we incur from the load.
The ability to tune the crawl, or having GAE not charge for requests from Google IPs would be great options.

I think GoogleBot Punches through edge cache too.
I can see the peaks that are caused by the Google Bot hitting my systems, but in the request types my dynamic requests go up, but my cached requests stay where they were. Even if Google Bot was not getting quite the ratios users were, you’d expect there to be a lift during these times. I mean the Bot raises traffic 10 fold for an hour, you’d think I’d see 20% lift in cached requests during this time.
I’m mostly just “venting” since I know the Crawl team won’t see this, and my guy over there has long since moved on… but just something I noticed.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/68trxyxxuaMJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Now Google needs to go through and find all instances that aren’t here and remove those pages from the index as search spam….
Matt Cutts and I have fought over this. I have suggested honey potting multiple times and no one likes the idea but me. I even built a HUGE golden honey pot of a billion nonsensical phrases that didn’t exist in search results and then used RSS and Email to seed them to spam sites. The result was that I sky rocketed to alexa 10k, and when I shared the list of pirate sites that had scraped the site the result was I was delisted and the rankings were spread across 100k spam sites. It was awesome.
I’m not even “Greatest Living American” anymore.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/L8dLJ_6jsGEJ.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.