In comp.infosystems.www.misc, Ivan Shmakov <
iv...@siamics.net> wrote:
> >>>>>
jdall...@yahoo.com writes:
> [Cross-posting to news:comp.infosystems.www.misc as I feel that
> this question has more to do with Web than HTML per se.]
:^)
> > I have a website organized as a large number (> 200,000) of pages.
> > It is hosted by a large Internet hosting company.
...
> > My users may click to 10 or 20 pages in a session. But the indexing
> > bots want to read all 200,000+ pages! My host has now complained
> > that the site is under "bot attack" and has asked me to check my own
> > laptop for viruses!
200k pages isn't that huge, and if static files on disk, as described in
a snipped out part, shouldn't be that hard to serve. Bandwidth may be an
issue, depending on how you are being charged. And on a shared system,
which I think you might have, your options for optimizing for massive
amounts of static files might be limited.
> > I'm happy anyway to reduce the bot activity. I don't mind having my
> > site indexed, but once or twice a year would be enough!
Some of the better search engines will gladly consult site map files
that give hints about what needs reindexing. See:
https://www.sitemaps.org/protocol.html
> > I see that there is a way to stop the Google Bot specifically. I'd
> > love it if I could do the opposite -- have *only* Google index my
> > site.
> JFTR, I personally (as well as many other users who value their
> privacy) refrain from using Google Search and rely on, say,
>
https://duckduckgo.com/ instead.
Yeah, Google only is an "all your eggs in one basket" route. I, too,
have been using DDG almost exclusively for several years.
> > A technician at the hosting company wrote to me
> >> As per the above logs and hitting IP addresses, we have blocked the
> >> 46.229.168.* IP range to prevent the further abuse and advice you to
> >> also check incoming traffic and block such IP's in future.
46.229.168.0-46.229.168.255 is:
netname: ADVANCEDHOSTERS-NET
Can't say I've heard of them.
> >> We have also blocked the bots by adding the following entry
> >> in robots.txt:-
> >> User-agent: AhrefsBot
Yes, block them. Not a search engine, but a commercial SEO service.
https://ahrefs.com/robot
> >> User-agent: MJ12bot
Eh, maybe block, maybe not. Seems to be real serach engine.
http://mj12bot.com/
> >> User-agent: SemrushBot
Yes, block them. Not a search engine, but a commercial SEO service.
https://www.semrush.com/bot/
> >> User-agent: YandexBot
Real Russian search engine.
https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml
> >> User-agent: Linguee Bot
Real service, but dubious value to a webmaster.
http://www.botreports.com/user-agent/linguee-bot.shtml
All bots can be impersonated by other bots, so you can't be sure the
User-Agent: will be the real identity of the bots. You can spend a lot
of time researching bots and the characteristics of real bot usage, eg
hostnames or IP address ranges of legit bot servers.
Given the little I've seen here, I wonder if you have someone at
Advanced Hosters impersonating bots to suck your site down.
> As long as the troublesome bots honor robots.txt (there're those
> that do not; but then, the above won't work on them, either),
> a more sane solution would be to limit the /rate/ the bots
> request your pages for indexing, like:
>
> ### robots.txt
>
> ### Data:
>
> ## Request that the bots wait at least 3 seconds between requests.
> User-agent: *
> Crawl-delay: 3
>
> ### robots.txt ends here
Except for Linguee, I think all of the bots listed above are
well-behaved and will obey robots.txt, but I don't know if they are all
advanced enough to know Crawl-delay. Some of them explicitly state they
do, however.
> This way, the bots will still scan all your 2e5 pages, but their
> accessess will be spread over about a week -- which (I hope)
> will be well within "acceptable use limits" of your hosting
> company.
Only bot I've ever had to blacklist was a MSN bot that absolutely
refused to stop hitting one page over and over again a few years ago. I
used a server directive to shunt that one bot to 403 Forbidden errors.
Elijah
------
stopped worring about bots a long time ago