It seems this was a difficult question.
For the time being, until I learn a way to specifically block non-Google operated bots coming from the Google IP address space, I've blocked any and all instances of Googlebot.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
What is the best method to block Google App Engine access to a particular domain? I'm seeing fingerprints of attempted abuse, and am decidedly not inclined to chase and flag them one by one, moreover, I categorically don't trust Google App Engine in any non-Google hands. So I'm looking for a simple way to lock out any Google App Engine access.
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Jeff
If you want to block Google AppEngine just block the Appengine User agent in your HTACESS or equivalent. Users can’t change the User Agent the way they do Curl, so if you block the useragent you block all of appengine.
For those of you who were wondering why you would do this… Appengine makes a great proxy and since it comes from 100’s of addresses if you limit an api call to an IP, an Appengine user can by-pass that limit. Also because of the way IP’s are round robin’ed it appears that sometimes AppEngine, and Google Bot Share Ips so you wouldn’t want to block all of Google, because your site wouldn’t get indexed.
--
App Engine is just a thing for hosting web sites and web apps--
anything you could do on App Engine, you could also do on GoDaddy,
Bluehost, Rackspace, Azure, 1&1, Webfaction, Dreamhost, Slicehost,
Heroku, Engine Yard, A Small Orange, Amazon EC2, etc.
Things I *would* call "search engine related API development frameworks":
- http://www.rollyo.com/
- Google Custom Search (and it's API)
- Yahoo Boss: http://developer.yahoo.com/search/boss/
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
--
Ross M Karchner
--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
Nick is right. If you want to have an API, you shouldn’t block based on the method of “curl” being used, but by something more like user/pass, OAUTH, Token, or any of a dozen other methods.
If you are just talking about a Web site not an API, then I could see preventing scraping by saying “nothing but Humans and the big 3 search bots” I have done this to sites in the past. Of course you then also have to check for “browsing faster than humans do. Or “not rendering the page” I had a site that we wanted users to sign up for coupons ala groupon and we had trouble with other people snarfing the deals of the day and the promo codes so we implemented a “you have to download this .js file contained in the HTML once every 3 page views” worked well, even if it was secure through obscure.
-Brandon
--