How to block Google App Engine?

1,416 views
Skip to first unread message

Álvaro Degives-Más

unread,
Mar 28, 2011, 6:06:54 PM3/28/11
to google-a...@googlegroups.com
What is the best method to block Google App Engine access to a particular domain? I'm seeing fingerprints of attempted abuse, and am decidedly not inclined to chase and flag them one by one, moreover, I categorically don't trust Google App Engine in any non-Google hands. So I'm looking for a simple way to lock out any Google App Engine access.

Álvaro Degives-Más

unread,
Mar 29, 2011, 1:26:06 AM3/29/11
to google-a...@googlegroups.com
It seems this was a difficult question. For the time being, until I learn a way to specifically block non-Google operated bots coming from the Google IP address space, I've blocked any and all instances of Googlebot.

Barry Hunter

unread,
Mar 29, 2011, 12:58:23 PM3/29/11
to google-a...@googlegroups.com
On 29 March 2011 06:26, Álvaro Degives-Más <adegi...@gmail.com> wrote:
It seems this was a difficult question.

Why do you say that?
 
For the time being, until I learn a way to specifically block non-Google operated bots coming from the Google IP address space, I've blocked any and all instances of Googlebot.

One way is via the useragent. All requests coming from AppEngine, clearly identify themselves in the User-Agent header. There is no way for the developer to send a custom (or fake) header.

 

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Nick Johnson (Google)

unread,
Mar 30, 2011, 10:36:43 PM3/30/11
to google-a...@googlegroups.com, Álvaro Degives-Más
Hi Álvaro,

As Barry says, you can do this by checking the user-agent; while App Engine apps may add to the user-agent header, they cannot remove their own App ID from it.

Can you elaborate on why you want to do this, though? We take apps that violate our TOS seriously, and if you're speaking about abuse that isn't in violation of our TOS, it's a big bad internet out there, and there are plenty of other sources of abuse; blocking App Engine as a whole probably won't do much to help, but it will cause significant collateral damage for legitimate apps who want to use your service.

-Nick Johnson

On Tue, Mar 29, 2011 at 9:06 AM, Álvaro Degives-Más <adegi...@gmail.com> wrote:
What is the best method to block Google App Engine access to a particular domain? I'm seeing fingerprints of attempted abuse, and am decidedly not inclined to chase and flag them one by one, moreover, I categorically don't trust Google App Engine in any non-Google hands. So I'm looking for a simple way to lock out any Google App Engine access.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.



--
Nick Johnson, Developer Programs Engineer, App Engine


Álvaro Degives-Más

unread,
Mar 31, 2011, 8:48:35 PM3/31/11
to google-a...@googlegroups.com, Álvaro Degives-Más
Hi Nick - and by extension, Barry as well (unfortunately I appear to have sent my reply directly to him - my apologies as I didn't CC myself so I can't share what exactly I wrote!)

First of all, rest assured that my concerns are not necessarily with Google App Engine, but rather the species of search engine related API development frameworks that rely on that particular address space, perhaps more commonly referred to as cloud leveraged app platforms.

The problem is that search engines - such as Google's - are routinely polluted; that is not attributable to negligence but it's the same sad reality nonetheless. Such polluted entries (e.g. certain queries) are used as a vector tampering with other, external properties. No amount of "sanitization" can counter the fundamental lack of a "permissible URL tokenizing" framework, i.e. something which communicates in a uniform manner to all interested parties (i.e. the Google family) what a "permissible" URL looks like.

Sadly, the robots.txt syntax and the meta tag nofollow,noindex both lack this "syntax whitelisting" feature; they are not prescriptive ("only crawl and index the URLs that look like this, and ignore the rest"). Of course, with many if not most standard on-site search queries, it is possible to script page headers that include nofollow,noindex metatags. But many other kinds of dynamic content aren't easily "wrapped" with such headers.

And that is where abuse of poisoned search engine indexes come into play.

Just as I can't hunt down every non-canonical URL in the Google index, flagging issues case-by-case is not only not effective (if only because my logs demonstrate that) but practically prohibitive as well (I assume you can imagine that I'm not interested in hunting down all search engine based botnet traffic and relating that to individual sources) so my alternative is to simply shut down access to search engines. I don't have the time or the resources to play whack a mole with the ever increasing scourge of botnets. Incidentally, a look at traffic evolution in my traffic logs and a cursory look at some well-known email spam statistics suggests that indeed there's a quantum shift afoot, shifting from email to (particularly) smaller web property targeting for invasive "advertising" methods by the miscreants out there.

And that is exactly what I have chosen to do: the well-behaved search engines (Google, Bing, Yahoo) are informed via robots.txt that they are not welcome, and their indexes are cleared out; the ill-behaved ones are blocked and upon sight rigorously reported to blacklists.

Until there is something available which gives website proprietors (especially the small to medium sized ones!) a trivial and effective means to control which content is accessible for storage and further processing in the cloud, the internet will continue to shrink.

Indeed, with heavy heart. But I don't have the resources to keep my web-based property open to "play nice" with worthwhile endeavors such as Google App Engine, while a notorious minority of criminals (I openly prefer the "terrorist" moniker) runs amok with virtual impunity. And so, I set a tight regime for wrapper security scripts (e.g. ZB Block, which I find quite effective and flexible).

Hopefully you now understand better; it's not that I mistrust Google, or Google App Engine in particular. I just can't afford to be available for well-intended fun and games while carrying the weight of incidental abuse at my own expense.

Jeff Schnitzer

unread,
Mar 31, 2011, 8:58:42 PM3/31/11
to google-a...@googlegroups.com
Can someone translate this into english?

Jeff

Álvaro Degives-Más

unread,
Mar 31, 2011, 9:02:02 PM3/31/11
to google-a...@googlegroups.com
Friendly reminder: "English" is capitalized. :-)

Philip

unread,
Apr 1, 2011, 12:40:32 AM4/1/11
to Google App Engine
Google search has nothing to do with app engine. According to their
privacy policy they don't have a right to use our app engine data at
all.

Álvaro Degives-Más

unread,
Apr 1, 2011, 12:45:51 AM4/1/11
to google-a...@googlegroups.com
Peril travels the other way around: Google App Engine can use index data from Google. Or other SEs for that matter.

Brandon Wirtz

unread,
Apr 1, 2011, 1:36:41 AM4/1/11
to google-a...@googlegroups.com

If you want to block Google AppEngine just block the Appengine User agent in your HTACESS or equivalent.  Users can’t change the User Agent the way they do Curl, so if you block the useragent you block all of appengine.

 

For those of you who were wondering why you would do this… Appengine makes a great proxy and since it comes from 100’s of addresses if you limit an api call to an IP, an Appengine user can by-pass that limit.  Also because of the way IP’s are round robin’ed it appears that sometimes AppEngine, and Google Bot Share Ips so you wouldn’t want to block all of Google, because your site wouldn’t get indexed.

--

Robert Kluin

unread,
Apr 1, 2011, 2:32:05 AM4/1/11
to google-a...@googlegroups.com
Wow. That is a substantial block of text. Are you trying to say you
are mad because some App Engine app is proxying your site?

Álvaro Degives-Más

unread,
Apr 1, 2011, 3:18:28 AM4/1/11
to google-a...@googlegroups.com
No Robert: the point is to preclude Google from perpetrating evil, even when its committed against its own intent. Incidentally, anger is a wasted emotion in security issues. It's almost as bad as overlooking basic security while commenting on the length of messages.

Kudos to Brandon for pointing out the elephant in the middle of the ball pit.

Calvin

unread,
Apr 1, 2011, 3:28:47 AM4/1/11
to google-a...@googlegroups.com
Good gravy! Your point is so veiled in flowery metaphor that I still haven't understood what the core issue is.  I thought Robert had it for a second there, but no.

Now my theories are that you're either a turing-test bot, or a Batman villain.  Is Google sponsoring an exhibit at the Gotham Museum of Art?

Ross M Karchner

unread,
Apr 1, 2011, 8:32:13 AM4/1/11
to google-a...@googlegroups.com
I think the phrase "search engine related API development frameworks"
might be the key to the misunderstanding here-- App Engine and Google
Search are simply products of the same company. It's like calling
Microsoft Windows a "search related operating system" simply because
Microsoft also makes Bing.

App Engine is just a thing for hosting web sites and web apps--
anything you could do on App Engine, you could also do on GoDaddy,
Bluehost, Rackspace, Azure, 1&1, Webfaction, Dreamhost, Slicehost,
Heroku, Engine Yard, A Small Orange, Amazon EC2, etc.

Things I *would* call "search engine related API development frameworks":

- http://www.rollyo.com/
- Google Custom Search (and it's API)
- Yahoo Boss: http://developer.yahoo.com/search/boss/

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

--
Ross M Karchner

Darien Caldwell

unread,
Apr 1, 2011, 11:43:12 AM4/1/11
to Google App Engine
If you care so little about search engine rank, that you're willing to
block all search engines, why block them at all? Let them rank you
badly, or well, or basically what would it matter, if you don't care
about it?

Maybe if you more clearly defined this sentence, it would make sense:

"polluted entries (e.g. certain queries) are used
as a vector tampering with other, external properties"

Give concrete examples.

Nick Johnson (Google)

unread,
Apr 4, 2011, 1:02:40 AM4/4/11
to google-a...@googlegroups.com
Hi Álvaro,

Please correct me if I'm wrong, but it's my understanding that you operate a service that provides an API, and that you're seeing abuse of that API from an app that runs on App Engine.

First of all, I'd like to reiterate that App Engine has clear policies on acceptable use. If the app in question is in violation of those, there are consequences - and if that's the case, you should send us the App ID so we can investigate.

More generally, assuming the behaviour consists of abuse of your API but is not in itself in violation of the TOS, there's a set of guidelines that I think need to be applied when anyone is considering applying a block across a broad category (Eg, a range of IP addresses). That criteria depends on your particular sensitivity to false positives and false negatives, but in general is simply to assess the proportion of legitimate and unwanted traffic coming from the target, and compare it to what you see from the wider internet.

If the proportion is equal to or lower than what you see elsewhere - and I believe you will find that is the case from App Engine and most other well run services where outgoing apps share a pool of IPs - it makes little sense to block the service, as you could extend the same argument to every other segment of your users, resulting in shutting down your entire service.

A much better option is to distinguish applications on a more granular basis, for which we provide the application's App ID in the User-Agent header. I would encourage you to reconsider filtering based on this. Since you're operating a blacklist, you do not even need to check the source IP against our netblocks; apps outside App Engine have nothing to gain by imitating an App Engine app, and those within App Engine cannot erase their App ID from the header.

-Nick Johnson

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Kaan Soral

unread,
Apr 4, 2011, 3:11:45 PM4/4/11
to Google App Engine
Wow I haven't encountered such a complicated text for a long time.

Álvaro Degives-Más

unread,
Apr 6, 2011, 4:02:43 AM4/6/11
to google-a...@googlegroups.com
Hi Nick, that's excellent advice, and a good analysis from the quantitative end of things. The problem is that abuse does not manifest itself in a statistically more or less uniform manner, so as to allow applying risk factoring as you suggest it.

It's an issue that goes far beyond Google App Engine; I was merely looking for ways to strike a fair balance. But in the end, cost-benefit advises against the greater granularity you also suggest. To put it in another analogy: sometimes, disconnecting the line is the more efficacious solution for harassment calls.

Álvaro Degives-Más

unread,
Apr 6, 2011, 4:21:31 AM4/6/11
to google-a...@googlegroups.com
Ross, thanks for your clarification on my crude classification but the key aspect for me is that Google App Engine operates from within Google's address space. So, what a Google App Engine powered service does as intended, i.e. its functional typology is much less important (or defining) to me than the fact that it comes traveling from Google. Or to put it less elegantly: not all those you mention have as much in the balance, and I'm therefor much less disturbed by false positives in their cases.

Brandon Wirtz

unread,
Apr 6, 2011, 4:23:47 AM4/6/11
to google-a...@googlegroups.com

Nick is right.  If you want to have an API, you shouldn’t block based on the method of “curl” being used, but by something more like user/pass, OAUTH, Token, or any of a dozen other methods.

 

If you are just talking about a Web site not an API, then I could see preventing scraping by saying “nothing but Humans and the big 3 search bots”  I have done this to sites in the past.  Of course you then also have to check for “browsing faster than humans do. Or “not rendering the page” I had a site that we wanted users to sign up for coupons ala groupon and we had trouble with other people snarfing the deals of the day and the promo codes so we implemented a “you have to download this .js file contained in the HTML once every 3 page views”  worked well, even if it was secure through obscure.

 

-Brandon

--

Álvaro Degives-Más

unread,
Apr 6, 2011, 4:27:56 AM4/6/11
to google-a...@googlegroups.com
You're absolutely right. I'm not much concerned with search engine ranking; that's a hobby I don't care for. I'm solely concerned with appearing in search engine indexes when people search for specific stuff as a potential vector. That is what prompts me to go about with a machete.

What too many people overlook too often is that from the web asset proprietor's end, control over what is indexed (aside from what is crawled) is limited.

Álvaro Degives-Más

unread,
Apr 6, 2011, 4:51:39 AM4/6/11
to google-a...@googlegroups.com
In my case, I have both; the API isn't one particularly apt for user authentication (i.e. I'm not looking at limiting users to a somehow pre-accredited group) but more for validation, in a more or less trivial sense. The thing is, as I'm essentially seeing that "trivial" doesn't exist, the API is going to be axed, even though I was trying to avoid that.

As to the scraping issue, you're quite right; the solution is a walled garden. Even so, the "big three" are fingerprinted so as to exclude non-SE access, as well as access to unauthorized bits (fortunately, the "big three" are quite tolerant now toward getting smacked with a 403).

JH

unread,
Apr 6, 2011, 7:42:50 PM4/6/11
to Google App Engine
trolling?

On Apr 6, 3:51 am, Álvaro Degives-Más <adegives...@gmail.com> wrote:
> In my case, I have both; the API isn't one particularly apt for user
> authentication (i.e. I'm not looking at limiting users to a somehow
> pre-accredited group) but more for *validation*, in a more or less trivial

Sudhir Jonathan

unread,
Apr 7, 2011, 1:46:11 AM4/7/11
to Google App Engine
Yeah, I think this is trolling.

I think the easiest and most efficient way to solve the problems he's
facing is to block all IPs that match the 0.0.0.0 mask, and also
blacklist all user agents that have length > 0. Should work like a
charm.

Sudhir

Kaan Soral

unread,
Apr 7, 2011, 3:29:45 PM4/7/11
to Google App Engine
lol

jswap

unread,
Nov 13, 2014, 6:30:42 PM11/13/14
to google-a...@googlegroups.com, adegi...@gmail.com
You can add these lines to your apache virtualhost:

RewriteEngine on
# next 2 lines block all requests from Google App Engine
RewriteCond %{HTTP_USER_AGENT}       ^AppEngine-Google
RewriteRule .* - [F,L,E=nolog:1]
Reply all
Reply to author
Forward
0 new messages