Re: [google-appengine] Google App Engine, rogue crawlers, and PageSpeed Insights

381 views
Skip to first unread message

Jeff Schnitzer

unread,
Jul 26, 2012, 5:27:27 PM7/26/12
to google-a...@googlegroups.com
Every fetch request from GAE includes the appid as a header... you
obviously see it yourself, which is how you know the appid of the
crawler. This is how Google enables you to block applications; just
block all requests with that particular header.

Jeff

On Wed, Jul 25, 2012 at 9:35 AM, jswap <js...@yahoo.com> wrote:
> I run a website containing lots of doctor-related data. We get crawled by
> rogue crawlers from thousands of IP addresses DAILY (mostly in Russia) and
> we sometimes see our content show up on other websites. I define a crawler
> as "rogue" when it does not obey robots.txt exclusions, and the crawling
> company offers no benefit to us and just sucks up system resources.
>
> Google App Engine is hosting a crawler (appid: s~steprep) that is similar to
> the Russian ones we block. This crawler crawls us aggressively, sucks up
> system resources, ignores the robots.txt file, and offers no benefit to us.
> Per our usual policy, we have been blocking the hundreds of Google IP
> addresses that this crawler is crawling from. The problem is that one or
> more of these IP addresses also hosts Google's "PageSpeed Insights" page,
> located here: https://developers.google.com/speed/pagespeed/insights
>
> My questions for Google are:
> 1 - Is it your intention that websites be unable to block crawlers that you
> host?
> 2 - Is it your intention that websites must allow the steprep crawler in
> exchange for using the PageSpeed Insights tool?
>
> Some people may suggest "why not just ask the company crawling you to stop
> crawling you?"
> 1 - Some companies ignore the request.
> 2 - Some companies temporarily stop crawling, then show up again a few days
> or weeks later, at which point I have to waste time dealing with it all over
> again.
>
> If we were to allow every crawler to crawl our site, our server would be
> brought to its knees. I'm not going to waste money on increasing server
> resources just so more crawlers can scrape our data. Website owners need a
> mechanism for blocking rogue crawlers, even when they are hosted by Google
> App Engine.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/Bo8u134CRr8J.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

jswap

unread,
Jul 26, 2012, 8:47:11 PM7/26/12
to google-a...@googlegroups.com, je...@infohazard.org
Thanks, Jeff, but how do I block requests by header and not by IP?  I usually use iptables to block the requests, but cannot do so in this situation because then I block access to Google's PageSpeed Insights tool too.

Jeff Schnitzer

unread,
Jul 26, 2012, 9:41:06 PM7/26/12
to google-a...@googlegroups.com
It would have to be by something at "Layer 7" that understands HTTP.
What web server/technology are you using? With apache you can do it
with mod_rewrite.

Blocking IP addresses is really a clumsy way to do it anyways since
GAE urlfetch changes IP ranges periodically.

If you really don't like the scraper, I suggest an alternative to
simply blocking them. That'll probably just put a bunch of errors in
their logs and alert them to the problem. More fun is to silently
replace the content with something nefarious. The best option would
probably content that Googlebot will detect as being spammy/low
quality so it kills their search ranking.

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/wpzX8AzTGogJ.

jswap

unread,
Jul 26, 2012, 10:16:35 PM7/26/12
to google-a...@googlegroups.com, je...@infohazard.org
I like how your mind thinks, Jeff :)

I did some googling and found the specifics on how to block using apache's mod_rewrite.  For the benefit of others, I post it here:

Inside your virtual host:

RewriteEngine on
# start
RewriteCond     %{HTTP_USER_AGENT}  ^AppEngine-Google;.*appid:.*steprep
RewriteRule .* - [F,L,E=nolog:1]
# end

#  env=!nolog tells apache not to log when the nolog env var is set. you probably already have this line so just append the " env=!nolog"
CustomLog logs/access_log combined  env=!nolog


On Thursday, July 26, 2012 9:41:06 PM UTC-4, Jeff Schnitzer wrote:
It would have to be by something at "Layer 7" that understands HTTP.
What web server/technology are you using?  With apache you can do it
with mod_rewrite.

Blocking IP addresses is really a clumsy way to do it anyways since
GAE urlfetch changes IP ranges periodically.

If you really don't like the scraper, I suggest an alternative to
simply blocking them.  That'll probably just put a bunch of errors in
their logs and alert them to the problem.  More fun is to silently
replace the content with something nefarious.  The best option would
probably content that Googlebot will detect as being spammy/low
quality so it kills their search ranking.

Jeff

Drake

unread,
Jul 26, 2012, 11:45:45 PM7/26/12
to google-a...@googlegroups.com
And then when Google Spam team bot shows up you would be delisted... That
would Rock...

Don't ever serve anything to a bot other than the content, or a permission
denied. (maybe a busy signal)


If you have an htaccess try: You can add Appengine, or anyother bot pretty
easy using this.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craf...@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
> appengine+...@googlegroups.com.

hyperflame

unread,
Jul 27, 2012, 2:38:45 PM7/27/12
to Google App Engine
Alternately, you could institute a rate limiting mechanism. If a user
asks for more than X pages over a specified time period, serve up a
HTTP 429 error code (stands for Too Many Requests). Legitimate bots
such as GoogleBot will slow down their requests, while poorly-written
bots will most likely fail.

You could also put a trap into your robots.txt. List a url in your
robots.txt that goes to a servlet, if a user hits that servlet, their
IP is banned for some amount of time.

Jeff Schnitzer

unread,
Jul 27, 2012, 4:59:04 PM7/27/12
to google-a...@googlegroups.com
On Thu, Jul 26, 2012 at 8:45 PM, Drake <dra...@digerat.com> wrote:
> And then when Google Spam team bot shows up you would be delisted... That
> would Rock...

It's highly improbable that anyone in an official capacity at Google
will ever view your page with the exact User-Agent:

AppEngine-Google; (+http://code.google.com/appengine; appid: s~steprep)

Jeff

Kate

unread,
Aug 2, 2012, 4:01:33 PM8/2/12
to google-a...@googlegroups.com, je...@infohazard.org
I am having a similar problem and still cannot find an answer. The requests are all curl requests and I have tried everything I can think of.

I tried using appengine_config.py and checking for a user agent but that didn't work. All the IP addresses are different.

Surely there must be a solution for this sort of problem.

Kate

unread,
Aug 2, 2012, 4:08:21 PM8/2/12
to google-a...@googlegroups.com, je...@infohazard.org
How can I block the following curl requests. Not every IP is different and I get 10s of 1000s of them every day.

Honestly I do not know HOW to block them. What method/code?


2012-08-02 15:03:21.103 / 405 55ms 0kb curl/7.18.2 (i386-redhat-linux-gnu) libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 libidn/0.6.14 libssh2/0.18
132.72.23.10 - - [02/Aug/2012:13:03:21 -0700] "HEAD / HTTP/1.1" 405 124 - "curl/7.18.2 (i386-redhat-linux-gnu) libcurl/7.18.2 NSS/3.12.2.0 zlib/1.2.3 libidn/0.6.14 libssh2/0.18" "aussieclouds.appspot.com" ms=56 cpu_ms=0 api_cpu_ms=0 cpm_usd=0.000045 instance=00c61b117c41a67b1b944a189d7cc38d5365564c
> To post to this group, send email to google-appengine@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengine+unsubscribe@googlegroups.com.

Stuart Langley

unread,
Aug 4, 2012, 7:59:32 AM8/4/12
to google-a...@googlegroups.com, je...@infohazard.org
405 is being returned for these requests anyway. 

The incoming rate is <1 QPS - beside filling up your logs I'm not sure how, if at all, this is effecting your app.
Reply all
Reply to author
Forward
0 new messages