80legs spider is abusive

2,682 views
Skip to first unread message

Scott G

unread,
Nov 18, 2010, 4:14:28 PM11/18/10
to SANS Internet Storm Center / DShield
I have recently run into a new web bot that isn't obeying the
robots.txt and is very abusive. It's actions feel like a attack and
even there website claims you can't stop them.

See http://www.80legs.com/webcrawler.html

All my rewrites rules hasn't stopped them either and when this bot
hits it come in 5 minute bursts with about thousand IP's at one time.

They say...

Blocking our web crawler by IP address will not work. Due to the
distributed nature of our infrastructure, we have thousands of
constantly changing IP addresses. We strongly recommend you don't try
to block our web crawler by IP address, as you'll most likely spend
several hours of futile effort and be in a very bad mood at the end of
it.

This type of crawling is just bad news and regardless of what they say
the bot does NOT obey the robots.txt and in fact has never read it on
the server they are attacking.

Has anyone else had the same problem with them ?

Johannes B. Ullrich

unread,
Nov 18, 2010, 4:36:13 PM11/18/10
to iscds...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

The description sure sounds abusive. In particular if it does not obey robots.txt. Most crawlers can't be blocked easily by IP address as most of them use some form of a distributed crawler network. But they should obey robots.txt

Can you block it by user agent?

> --
> Need IPv6 Training? See http://www.ipv6securitytraining.com . IPv6 Security Training
>
> To unsubscribe from this group, send email to
> iscdshield+...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/iscdshield?hl=en

Johannes Ullrich
jull...@euclidian.com
(757) 726 7528


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (Darwin)

iEYEARECAAYFAkzlnE0ACgkQPNuXYcm/v/3d+ACfbatuqXr13jASABid993TU1QC
im0An0tZ+9jfpJtgxm+9pwZX47ADgdBN
=SFSo
-----END PGP SIGNATURE-----

d...@sucuri.net

unread,
Nov 18, 2010, 4:39:44 PM11/18/10
to iscds...@googlegroups.com
They say on the page:

"
If you block 008 using robots.txt, you will see crawl requests die
down gradually, rather than immediately. This happens because of our
distributed architecture. Our computers only periodically receive
robots.txt information for domains they are crawling."

So they should stop in a few...

thanks,

Brad Morgan

unread,
Nov 18, 2010, 5:02:04 PM11/18/10
to iscds...@googlegroups.com
> I have recently run into a new web bot that isn't obeying the robots.txt
and is very abusive. It's actions
> feel like a attack and even there website claims you can't stop them.

> See http://www.80legs.com/webcrawler.html

Scott,

I think you selectively edited the contents of their website a little bit
too much to bias your case...

They actually say...

----------

Blocking our web crawler by IP address will not work. Due to the
distributed nature of our infrastructure, we have thousands of
constantly changing IP addresses. We strongly recommend you don't try
to block our web crawler by IP address, as you'll most likely spend
several hours of futile effort and be in a very bad mood at the end of

it. You really should just include us in your robots.txt or contact us
directly.

If you feel that 008 is crawling your website too quickly, please let us
know what an
appropriate crawl rate is. If you'd like us to stop crawling your website,
the best thing
to do is to block our web crawler using the robots.txt specification.
To do this, add the following to your robots.txt:

User-agent: 008
Disallow: /

If you block 008 using robots.txt, you will see crawl requests die down
gradually, rather
than immediately. This happens because of our distributed architecture. Our
computers only
periodically receive robots.txt information for domains they are crawling.

----------

It appears to me that if you have a robots.txt then they will abide by it
and if they don't, then they provide contact links to let them know.

Regards,

Brad


Scott G

unread,
Nov 18, 2010, 5:58:00 PM11/18/10
to SANS Internet Storm Center / DShield
I have tried blocking them in the robots.txt and no I am not being
selective. I am quoting exactly what they say.

You should see the slam of IP's that come every minute here, over 100
IP's all slamming the server, that is abusive.


On Nov 18, 2:02 pm, "Brad Morgan" <b-mor...@concentric.net> wrote:
> > I have recently run into a new web bot that isn't obeying the robots.txt
>
> and is very abusive. It's actions
>
> > feel like a attack and even there website claims you can't stop them.
> > Seehttp://www.80legs.com/webcrawler.html

Tom Byrnes

unread,
Nov 18, 2010, 10:28:57 PM11/18/10
to iscds...@googlegroups.com

[Tomas L. Byrnes] Brad, Scott has tried the robots option, as have many
others who post about this net abuser, and it doesn't work.

Does anyone have a feed of their nodes that ThreatSTOP can publish?

Tom Byrnes

unread,
Nov 18, 2010, 10:31:06 PM11/18/10
to iscds...@googlegroups.com

> -----Original Message-----
> From: iscds...@googlegroups.com [mailto:iscds...@googlegroups.com]

> On Behalf Of d...@sucuri.net
> Sent: Thursday, November 18, 2010 1:40 PM
> To: iscds...@googlegroups.com
> Subject: Re: [dshield] 80legs spider is abusive
>
> They say on the page:
>
> "
> If you block 008 using robots.txt, you will see crawl requests die
> down gradually, rather than immediately. This happens because of our
> distributed architecture. Our computers only periodically receive
> robots.txt information for domains they are crawling."
>
> So they should stop in a few...
>
> thanks,
>

[Tomas L. Byrnes]
They lie:

http://www.wxforum.net/index.php?topic=7623.0;wap2

Anirban Banerjee

unread,
Nov 19, 2010, 12:03:05 PM11/19/10
to iscds...@googlegroups.com
On Thu, Nov 18, 2010 at 7:31 PM, Tom Byrnes <to...@threatstop.com> wrote:


> -----Original Message-----
> From: iscds...@googlegroups.com [mailto:iscds...@googlegroups.com]
> On Behalf Of d...@sucuri.net
> Sent: Thursday, November 18, 2010 1:40 PM
> To: iscds...@googlegroups.com
> Subject: Re: [dshield] 80legs spider is abusive
>
> They say on the page:
>
> "
> If you block 008 using robots.txt, you will see crawl requests die
> down gradually, rather than immediately. This happens because of our
> distributed architecture. Our computers only periodically receive
> robots.txt information for domains they are crawling."
>
> So they should stop in a few...
>
> thanks,
>
[Tomas L. Byrnes]
They lie:

http://www.wxforum.net/index.php?topic=7623.0;wap2

Ya heard of similar complaints from a few friends too. They ended up trying to contact these guys.

--
 Anirban Banerjee
www.Stopthehacker.com 

Jim McCullough

unread,
Nov 19, 2010, 2:46:38 PM11/19/10
to iscds...@googlegroups.com
Several other friends have seen this and requested help with trying to
get the rate slowed. End result was dropping traffic from 008 .

Kane

unread,
Jan 6, 2011, 4:59:06 PM1/6/11
to SANS Internet Storm Center / DShield
I had same problem with this bot... it knows NO RULES, and it come
across as a DOS attack at hitting better 1/2 domains on my servers at
once. There so called IP's... hmm any one that uses "PROXY" servers to
search? WTF use your real IP(s) an take the same bandwidth hits as
your giving out! can't believe ppl pay them to leach free services for
the gain! thats wrong! google craws my site(s) many times an never
uses 5GB a day! any ways here's my 2 cents:

add these to IP chain tables an you'll see 80-90% drop from those F@?
ks

174.54.111.85 - 80legs.com 008/0.83
71.181.175.105 - 80legs.com
24.18.8.163 - 80legs.com
24.19.89.18 - 80legs.com
24.44.234.217 - 80legs.com
24.187.107.132 - 80legs.com
64.125.222.16 - 80legs.com
65.13.141.212 - 80legs.com
66.190.29.156 - 80legs.com
67.82.88.63 - 80legs.com
67.183.212.244 - 80legs.com
68.42.74.178 - 80legs.com
68.52.200.151 - 80legs.com
68.115.33.121 - 80legs.com
68.190.211.171 - 80legs.com
69.117.76.2 - 80legs.com
69.119.12.50 - 80legs.com
70.179.33.244 - 80legs.com
71.10.43.61 - 80legs.com
71.95.242.112 - 80legs.com
71.123.250.93 - 80legs.com
71.201.46.85 - 80legs.com
71.228.35.123 - 80legs.com
72.148.198.78 - 80legs.com
72.209.59.230 - 80legs.com
74.197.199.46 - 80legs.com
74.233.24.171 - 80legs.com
76.123.32.35 - 80legs.com
96.41.213.173 - 80legs.com
96.246.10.213 - 80legs.com
97.90.150.201 - 80legs.com
98.162.212.174 - 80legs.com
98.191.208.70 - 80legs.com
98.193.134.251 - 80legs.com
98.197.91.159 - 80legs.com
98.223.153.8 - 80legs.com
98.247.184.76 - 80legs.com
98.248.147.170 - 80legs.com
99.39.169.73 - 80legs.com
99.178.168.87 - 80legs.com
173.27.202.235 - 80legs.com
76.25.203.7 - 80legs.com
69.129.6.218 - 80legs.com
173.67.158.254 - 80legs.com
76.18.80.77 - 80legs.com
71.192.156.99 - 80legs.com
97.82.181.52 - 80legs.com
76.24.94.195 - 80legs.com
12.118.188.126 - 80legs.com
68.33.35.202 - 80legs.com
69.137.77.192 - 80legs.com
68.98.42.195 - 80legs.com
68.10.98.11 - 80legs.com

even try this:

# TELL 80legs to share its bandwidth
RewriteCond %{HTTP_USER_agent} env=80legs [NC,OR]
RewriteRule (.*) http://www.80legs.com/$1 [R=301,L,NC]




On Nov 19 2010, 1:46 pm, Jim McCullough <jim.mccullo...@gmail.com>
wrote:
> Several other friends have seen this and requested help with trying to
> get the rate slowed.  End result was dropping traffic from 008 .
>
> On Fri, Nov 19, 2010 at 9:03 AM, Anirban Banerjee
>
> <banerjee.anir...@gmail.com> wrote:
> >> > >> Seehttp://www.80legs.com/webcrawler.html
>
> >> > >> All my rewrites rules hasn't stopped them either and when this bot
> >> > >> hits it come in 5 minute bursts with about thousand IP's at one
> >> > time.
>
> >> > >> They say...
>
> >> > >> Blocking our web crawler by IP address will not work. Due to the
> >> > >> distributed nature of our infrastructure, we have thousands of
> >> > >> constantly changing IP addresses. We strongly recommend you don't
> >> > try
> >> > >> to block our web crawler by IP address, as you'll most likely spend
> >> > >> several hours of futile effort and be in a very bad mood at the end
> >> > of
> >> > >> it.
>
> >> > >> This type of crawling is just bad news and regardless of what they
> >> > say
> >> > >> the bot does NOT obey the robots.txt and in fact has never read it
> >> > on
> >> > >> the server they are attacking.
>
> >> > >> Has anyone else had the same problem with them ?
>
> >> > >> --
> >> > >> Need IPv6 Training? Seehttp://www.ipv6securitytraining.com. IPv6
> >> > Security Training
>
> >> > >> To unsubscribe from this group, send email to
> >> > >> iscdshield+...@googlegroups.com
> >> > >> For more options, visit this group at
> >> > >>http://groups.google.com/group/iscdshield?hl=en
>
> >> > >   Johannes Ullrich
> >> > >   jullr...@euclidian.com
> >> > >   (757) 726 7528
>
> >> > > -----BEGIN PGP SIGNATURE-----
> >> > > Version: GnuPG v1.4.11 (Darwin)
>
> >> > > iEYEARECAAYFAkzlnE0ACgkQPNuXYcm/v/3d+ACfbatuqXr13jASABid993TU1QC
> >> > > im0An0tZ+9jfpJtgxm+9pwZX47ADgdBN
> >> > > =SFSo
> >> > > -----END PGP SIGNATURE-----
>
> >> > > --
> >> > > Need IPv6 Training? Seehttp://www.ipv6securitytraining.com. IPv6
> >> > Security Training
>
> >> > > To unsubscribe from this group, send email to
> >> > > iscdshield+...@googlegroups.com
> >> > > For more options, visit this group at
> >> > >http://groups.google.com/group/iscdshield?hl=en
>
> >> > --
> >> > Need IPv6 Training? Seehttp://www.ipv6securitytraining.com. IPv6
> >> > Security Training
>
> >> > To unsubscribe from this group, send email to
> >> > iscdshield+...@googlegroups.com
> >> > For more options, visit this group at
> >> >http://groups.google.com/group/iscdshield?hl=en
>
> >> --
> >> Need IPv6 Training? Seehttp://www.ipv6securitytraining.com. IPv6
> >> Security Training
>
> >> To unsubscribe from this group, send email to
> >> iscdshield+...@googlegroups.com
> >> For more options, visit this group at
> >>http://groups.google.com/group/iscdshield?hl=en
>
> > --
> > Need IPv6 Training? Seehttp://www.ipv6securitytraining.com. IPv6 Security

Adam Corbally

unread,
May 6, 2013, 3:52:43 AM5/6/13
to iscds...@googlegroups.com

Found an entry that can be added to the sites .htaccess file to mave every request from the bot throw back a 403 error. Thought it might be useful should anyone stumble across this post again.

.htaccess:
SetEnvIfNoCase ^User-Agent$ .*(80legs) HTTP_SAFE_BADBOT
Deny from env=HTTP_SAFE_BADBOT
Reply all
Reply to author
Forward
0 new messages