Something to pass along to the google search team

68 views
Skip to first unread message

Joshua Smith

unread,
Sep 13, 2011, 9:41:49 AM9/13/11
to google-a...@googlegroups.com
In http://highscalability.com/blog/2011/9/7/what-google-app-engine-price-changes-say-about-the-future-of.html
he wrote:

> With each crawl costing money, the whole idea of crawling the Internet will have to change.

which led me to a thought: Since google bot is crawling zillions of web sites, a change from depth-first crawling to bread-first crawling would make a huge difference here. Dunno if that's practical, but it would be a nice thing for the google search guys to look into, to make GAE and Google-Bot more compatible. Because right now, there's a lot of evidence that they are accidentally conspiring to be evil.

-Joshua

Tim

unread,
Sep 13, 2011, 10:55:00 AM9/13/11
to google-a...@googlegroups.com

Google webmaster tools 

  https://www.google.com/webmasters/tools/home

lets you (amongst other things) submit sitemaps and see the crawl rate for your site (for the previous 90 days). There's also a form to report problems with how googlebot is accessing your site


The crawl rate is modified to try to avoid overloading your site, but given that GAE will just fire up more instances, then I guess googlebot thinks your site is built for such traffic and just keeps upping the crawl rate. You could try and mimic a site being killed by the crawler.... keep basic stats in memcache every time you get hit by googlebot (as idenified by request headers) and if the requests come too thick and fast, delay the responses, or simply return a 408 or maybe a 503 or 509 response, and my guess is you'll see the crawl rate back off pretty quickly.


Would be nice if robots.txt or sitemap files let you specify a maximum crawl rate (cf RSS files), or perhaps people agreed on an HTTP status code for "we're close, but not THAT close..." response to tell crawlers to back off (418 perhaps:) but I don't expect those standards have moved very much recently...

--
T

Joshua Smith

unread,
Sep 13, 2011, 11:04:41 AM9/13/11
to google-a...@googlegroups.com
Sure, but if they just went breadth-first (putting pages to crawl into the tail of a work queue that spans hundreds of sites), then there wouldn't be a spike at all.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/92F2o_-16zMJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Tim

unread,
Sep 13, 2011, 1:38:12 PM9/13/11
to google-a...@googlegroups.com


On Tuesday, 13 September 2011 16:04:41 UTC+1, Joshua Smith wrote:
Sure, but if they just went breadth-first (putting pages to crawl into the tail of a work queue that spans hundreds of sites), then there wouldn't be a spike at all.


I expect there's something about wanting to pull back a series of pages from a single site together to get a consistent series of pages (especially with session cookies and sessions encoded in URLs and the like) not to mention little things like HTTP pipelining requests and the internal management of assigning machines (including timeouts, failovers and retries), updating databases with results and meta-results and 101 other things that I can't even start to think about - not to say it can't be done, but I think it'd have a lot of hidden implications.

Still, you did say "dunno if it's practical" - I was just wondering about other ways to make googlebot more compatible with GAE and GAE like systems.

--
T

Brandon Wirtz

unread,
Sep 13, 2011, 3:17:08 PM9/13/11
to google-a...@googlegroups.com

That’s all good info, but it doesn’t apply if you are on GAE. If you are on GAE you can’t specify your crawl rate.  It is assigned a special Crawl rate.

--

Reply all
Reply to author
Forward
0 new messages