So many spam bots are hitting my website hosted on Google App Engine

1,405 views
Skip to first unread message

Ashutosh Mishra

unread,
Apr 20, 2015, 7:28:41 AM4/20/15
to google-a...@googlegroups.com
Please help any one to solve the issue so many spam bots are eating my bandwidth, my website is travel portal for car rental and hotel booking service myhotelcar.com which is hosted on Google App engine and developed on Java, from last few months bots hitting was gone to very high level particularly I am annoyed by the hitting of Spam Bot mainly Ahref bot.

I don't find any way to stop them as I didn't see .HTaccess file in Google App Engine and I have tried all the things that I can try with Robots.txt file but still I am not getting the thing right.

Please help me as soon as possible.



Barry Hunter

unread,
Apr 20, 2015, 8:24:42 AM4/20/15
to google-appengine
On 20 April 2015 at 12:28, Ashutosh Mishra <ashutosh.n...@gmail.com> wrote:
y I am annoyed by the hitting of Spam Bot mainly Ahref bot.

The Ahref bot (if its the legitimate one of course!) definitly obays the robots.txt

Looking at

there is nothing blocking that particular bot. 

But a number of other oddities. The crawl delay will only apply to the * group, which is disallowed from any crawling, meaning it has no effect. 

The 'directories' rules, will only apply the group they placed in (so only affect MJ12bot/v1.4.5 - which is blocked completely by the first rule)



 

I don't find any way to stop them as I didn't see .HTaccess file in Google App Engine


Not as such. You would need to handle any such directives directly in code. ie your javas handlers, could check the User-Agent and do 'stuff' selectively. 



 Tehre is also
but its utility for this is limited. (unless you can identify specific IP/ranges to block)

Vinny P

unread,
Apr 21, 2015, 12:19:30 AM4/21/15
to google-a...@googlegroups.com
On Mon, Apr 20, 2015 at 7:23 AM, Barry Hunter <barryb...@gmail.com> wrote:
The Ahref bot (if its the legitimate one of course!) definitly obays the robots.txt

Looking at http://www.myhotelcar.com/robots.txt there is nothing blocking that particular bot. 


+1.

You can also try looking into Cloudflare to proxy your site and filter out some robots.

More importantly: are the robots hitting all of your pages, or are they only hitting certain types of pages? Are they perhaps repeatedly retrieving a RSS feed or sitemap documents?
 
 
-----------------
-Vinny P
Technology & Media Consultant
Chicago, IL

App Engine Code Samples: http://www.learntogoogleit.com

Ashutosh Mishra

unread,
Apr 21, 2015, 12:32:12 AM4/21/15
to google-a...@googlegroups.com
Dear Berry, as you can see in robots.txt file og myhotelcar, I have blocked all the bots except the google, yahoo slurp etc, so Ahref bot if it obeys it it should not hit site. 

I have also searched so many thing and I found the Ahref bot doesn't obey robots principal.
Many people has suggested that I can prohibit them via htaccess file, I don't want to use that way as in google app engine hosting I didn't find htaccess file. So please provide me any way to filter out these spam bots.

Ashutosh Mishra

unread,
Apr 21, 2015, 12:36:15 AM4/21/15
to google-a...@googlegroups.com
Thanks Vinny,

I think you have picked the issue correctly they are hitting particular set of pages regularly hotel pages which were dynamically generated, you are correct about rss and sitemap feed.
So please tell me the way to overcome this issue as these spam bots specially ahref bot is consuming my server bandwidth a lot un-necessarily. I want a good solution so that I will not face any spam bot hurdle in future. 

Vinny P

unread,
Apr 21, 2015, 12:37:28 PM4/21/15
to google-a...@googlegroups.com
On Mon, Apr 20, 2015 at 11:32 PM, Ashutosh Mishra <ashutosh.n...@gmail.com> wrote:
I have also searched so many thing and I found the Ahref bot doesn't obey robots principal.
Many people has suggested that I can prohibit them via htaccess file, I don't want to use that way as in google app engine hosting I didn't find htaccess file. So please provide me any way to filter out these spam bots.


The .htaccess file isn't supported in App Engine. 

If this is the real Ahref bot, it should support robots.txt. I looked in your robots.txt file: I see you disallowing Baidu, Yandex and a wildcard disallow, but not specifically ahrefbot. Try adding the following to your robots file:

user-agent: AhrefsBot
disallow: /

According to the ahrefbot robot page, you can also email them directly to ask them to stop; see https://ahrefs.com/robot


On Mon, Apr 20, 2015 at 11:36 PM, Ashutosh Mishra <ashutosh.n...@gmail.com> wrote:
I think you have picked the issue correctly they are hitting particular set of pages regularly hotel pages which were dynamically generated, you are correct about rss and sitemap feed.
So please tell me the way to overcome this issue as these spam bots specially ahref bot is consuming my server bandwidth a lot un-necessarily. I want a good solution so that I will not face any spam bot hurdle in future. 


This happens to a lot of websites with a large set of dynamically generated pages. 

Honestly the best solution would be to sign up for Cloudflare ( https://www.cloudflare.com/google ) and use their tools to help filter incoming traffic. You can also do what Barry suggested earlier, and start blocking the IPs that ahrefsbot is using. 

If you're willing to do some coding, you can write a filter into your application to check for the useragent and kick back a 429 HTTP status code (Too Many Requests) if traffic is too high: http://tools.ietf.org/html/rfc6585#page-3

Ashutosh Mishra

unread,
Apr 22, 2015, 10:02:20 AM4/22/15
to google-a...@googlegroups.com
Hi Vinny, 

thanks for your comment I have done the changes in myhotelcar.com/bobots.txt file as you have mentioned but issue is still not resolved as per my analysis the bots hiiting specifically ahref has increased day by day an now issue seems critical. 

Please hep me to get out of this situation. I will happy to have your advice on this. 

Ashutosh Mishra

unread,
Apr 22, 2015, 10:07:03 AM4/22/15
to google-a...@googlegroups.com
Hi berry,

No you can see that I have updated the robots.txt to block specifically ahref bot file for myhotelcar.com but still issue remains un-resolved bots hitting the site regularly badly. I need your expert advice to stop any kind of spam bots or bots which doesn't obey robots.txt file rules. 
Any help will be appreciated.


On Monday, April 20, 2015 at 5:54:42 PM UTC+5:30, barryhunter wrote:

Barry Hunter

unread,
Apr 22, 2015, 10:09:43 AM4/22/15
to google-appengine
Have you cross checked the IP(s) of the bot? 

The User-Agent is easily spoofed, it might be some other bot just pertending to be a ahrefbot. 


Regardless, as already mentioned can put handlers in your code to 'trap' bad actors. Check the useagent, and do something different. (can't totally block this way, but can minimise damage -make the requests very quick/short. And by not returning further links, stop them finding yet more pages to index). 

... or use an external service to 'firewall' such requests - as already mentioned Cloudflare offer this. 




--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/181d93e6-b9e8-40e6-8a24-d883a2e315f8%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jeff Schnitzer

unread,
Apr 23, 2015, 2:13:08 AM4/23/15
to Google App Engine
I'm calling bullshit.

You have a website developed on GAE/Java but you don't understand what .htaccess is and why it doesn't apply? If you're having problems with your website, why don't you ask the people who developed it? I don't get it. The advice you have been offered here (all of which is reasonable) requires more technical sophistication than you exhibit.

Possibly this is a bot doing normal things. Possibly this is a real DOS attack of some kind. Post some real information like IP addresses and the actual rate of requests, and maybe we can help you with an appropriate mitigation strategy.

You have said a bunch of technically dumb things with an accusatory tone of voice (spam bots are attacking me!). This happens a lot, and usually it means _you_ just screwed something up. If you want help, post more information and be less arrogant about it. You don't know what you think you know.

Jeff

Ashutosh Mishra

unread,
Apr 23, 2015, 2:39:33 AM4/23/15
to google-a...@googlegroups.com, je...@infohazard.org
Hi Jeff,

Thanks for your harsh word suggestion, Please have a look to attached log file snap shot you can have IP of the ahref bot, it use to regularly coming on my site. I think this real information will be enough, so please let me know concrete solution to overcome this problem.  

Ashutosh Mishra

unread,
Apr 23, 2015, 2:42:13 AM4/23/15
to google-a...@googlegroups.com, je...@infohazard.org
Hi Jeff,

Thanks for your harsh word suggestion, Please have a look to attached log file snap shot you can have IP of the ahref bot, it use to regularly coming on my site. I think this real information will be enough, so please let me know concrete solution to overcome this problem.

On Thursday, April 23, 2015 at 11:43:08 AM UTC+5:30, Jeff Schnitzer wrote:
ahref bot log.bmp

Ashutosh Mishra

unread,
Apr 23, 2015, 2:45:23 AM4/23/15
to google-a...@googlegroups.com
Hi Berry,

You can have the bot details please help me to overcome the issue. I am really thankful to have your help till now. 
ahref bot log.JPG

Jeff Schnitzer

unread,
Apr 23, 2015, 12:32:37 PM4/23/15
to Google App Engine
What rate are these queries arriving? Are they all from the same IP address? Are they scanning your site or hitting one or two pages over and over? One request is not useful.

Assuming this is just a poorly behaved bot and not a DOS attack, the simplest solution is to install a servlet filter at the top of your stack. Inspect the request and if you don't like something about it (IP address, user agent, etc) and return a blank page (or goatse, or whatever). Short of a real DDOS, this will convert your expensive 1100ms requests into almost-free <10ms requests and mitigate the issue.

Jeff

Ashutosh Mishra

unread,
Apr 23, 2015, 2:52:26 PM4/23/15
to google-a...@googlegroups.com, je...@infohazard.org
Hi Jeff,

Thanks for your solution, but I can not go with filter, as filter will also increase cost and We are doing this only to reduce cost.

Please suggest me some SEO prospective way to resolve this issue, as I have monitoring this bot  user agen name remains same and IP too for some day around a week thereafter its IP changes, some time its IP shows France location some Time Brazil and Some time USA so its really difficult to block them on the basis of country traffic basis which I have already tried.
Now as I have increased my pages of the website its crawling rate also increased exponentially, in a minute 10 times its used to index the site, its really causing loss as my server hosting cost has increased due to this. 

So filter is not an option.

Barry Hunter

unread,
Apr 23, 2015, 3:19:13 PM4/23/15
to google-appengine


Thanks for your solution, but I can not go with filter, as filter will also increase cost

How so? 

Do you mean the developer time to make the filter? 

Ashutosh Mishra

unread,
Apr 29, 2015, 6:41:43 AM4/29/15
to google-a...@googlegroups.com
Hi Barry,

I have done the same as you have mentioned I have asked my development team to create a filter they have knowledge of that so they have done it, but still I didn't seen much difference in ahref bot issue. You can go through attached image here you can see that ahref bot still hitting the site server in sort interval of time.
Please let me know your expert advice regarding the same and please help me to come up against these spam bots.
ahref bot issue.JPG

Barry Hunter

unread,
Apr 29, 2015, 8:37:34 AM4/29/15
to google-appengine
Its not going to stop the requests (but long term will probably cur them down) 


... its just now you seem to only be 'taking' 62ms to return a 302. 

Previsoully where using 1152ms to return an actual page. 

You've saved resources, and bandwidth. 


(assuming they don't just automatically follow the 302 and then use more resources at the destination. )



Or could perhaps tarpit, but its risky


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To post to this group, send email to google-a...@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.

Jay Kyburz

unread,
May 10, 2015, 8:56:17 PM5/10/15
to google-a...@googlegroups.com
Why did nobody suggest the dos protection service. 



Vinny P

unread,
May 11, 2015, 1:29:16 AM5/11/15
to google-a...@googlegroups.com
On Sun, May 10, 2015 at 7:56 PM, Jay Kyburz <j...@jaykyburz.com> wrote:
Why did nobody suggest the dos protection service. 




The DOS protection service can help, but it's not as easy as proxying through Cloudflare. 

Also it can be difficult to manage - with the DOS protection service you have to name individual IPs/blocks of IPs to ban. In one of the posts above you can see the IP "188.165.15.95" originating these requests. That IP belongs to OVH, a fairly large web hosting firm. You either have to list out all the harassing IPs in the blacklist, or ban address blocks which may accidentally interfere with legitimate requests from OVH-rented servers.
Reply all
Reply to author
Forward
0 new messages