I run www.wallsofthecity.net, and my hosting company (DreamHost)
recently brought to my attention the fact that my site has been
receiving sufficient numbers of page requests to crash the shared
server I am on. They sent me the below information by way of
evidence:
Now, to me, that does not mean a whole lot, aside from Googlebots
being very, *very* interested in my site.
Given that DreamHost's solution is to simply block all Googlebots
forever, I am looking for an alternative fix. What would/could be
causing the bots to go this spastic? Is there something on my site?
I have already logged into Webmaster Tools and requested a slower
indexing rate, but will that actually accomplish anything? Thanks for
whatever assistance you can provide.
I do not understand how those figures prove anything, but let's work
on the assumption that googlebot is spending a lot of time crawling
your site and you want to know what might be contributing to this.
Your server is giving incorrect responses for some requests.
If your server repeatedly says 301 and then 200 for ALL AND ANY
nonexistent URLs, including perhaps ones that DID previously exist but
have been deleted, you are making it very hard for google to
"understand" your site and to keep the indexes up to date.
There are probably other issues but I hiope this is a useful start.
Linoge wrote:
> I run www.wallsofthecity.net, and my hosting company (DreamHost)
> recently brought to my attention the fact that my site has been
> receiving sufficient numbers of page requests to crash the shared
> server I am on. They sent me the below information by way of
> evidence:
> Now, to me, that does not mean a whole lot, aside from Googlebots
> being very, *very* interested in my site.
> Given that DreamHost's solution is to simply block all Googlebots
> forever, I am looking for an alternative fix. What would/could be
> causing the bots to go this spastic? Is there something on my site?
> I have already logged into Webmaster Tools and requested a slower
> indexing rate, but will that actually accomplish anything? Thanks for
> whatever assistance you can provide.
Well, I guess that just goes to show what I know, then :). I just
copy-pasted the information my hosting company sent me - I honestly do
not understand a great deal of it.
Thanks for the information, though - that was honestly educational and
useful. At this point, I think I will just keep the Googlebot out of
the "tdzkwiki" section, and then let it back into the root domain...
that should not cause any problems as long as things keep working.
> I do not understand how those figures prove anything, but let's work
> on the assumption that googlebot is spending a lot of time crawling
> your site and you want to know what might be contributing to this.
> Your server is giving incorrect responses for some requests.
> If your server repeatedly says 301 and then 200 for ALL AND ANY
> nonexistent URLs, including perhaps ones that DID previously exist but
> have been deleted, you are making it very hard for google to
> "understand" your site and to keep the indexes up to date.
> There are probably other issues but I hiope this is a useful start.
> Well, I guess that just goes to show what I know, then :). I just
> copy-pasted the information my hosting company sent me - I honestly do
> not understand a great deal of it.
> Thanks for the information, though - that was honestly educational and
> useful. At this point, I think I will just keep the Googlebot out of
> the "tdzkwiki" section, and then let it back into the root domain...
> that should not cause any problems as long as things keep working.
> Thanks again!
> On Jul 29, 4:44 pm, Robbo wrote:
> > I do not understand how those figures prove anything, but let's work
> > on the assumption that googlebot is spending a lot of time crawling
> > your site and you want to know what might be contributing to this.
> > Your server is giving incorrect responses for some requests.
> > But if I do the same with your tdzkwiki subdomain,http://tdzkwiki.wallsofthecity.net/zxzxzxzxzx > > the response given is false: it says 301 Moved permanently to:http://tdzkwiki.wallsofthecity.net/Zxzxzxzxzx > > (note the uppercase Z )
> > and requesting that URL (with the uppercase Z) gets a 200 success
> > response form your server which is obviously NOT right.
> > If your server repeatedly says 301 and then 200 for ALL AND ANY
> > nonexistent URLs, including perhaps ones that DID previously exist but
> > have been deleted, you are making it very hard for google to
> > "understand" your site and to keep the indexes up to date.
> > There are probably other issues but I hiope this is a useful start.
It does look like that wiki section of your site has a lot of unique
URLs which can be crawled. I imagine this could put a bit of load on
your server. Perhaps the simplest solution would be to disallow
crawling of that subdomain using a "disallow" robots.txt directive.
Will GoogleBots actually obey that? I know there is evidence to
support that they do not obey Crawl-Delay (something else I have
implemented in an attempt to make things run better), but I am not
sure about other aspects.
> It does look like that wiki section of your site has a lot of unique
> URLs which can be crawled. I imagine this could put a bit of load on
> your server. Perhaps the simplest solution would be to disallow
> crawling of that subdomain using a "disallow" robots.txt directive.
You're right, we do not use the "crawl-delay" robots.txt directive.
However, we will honor any "disallow" directives that you have in
there. The disallow blocks crawlers from accessing the URLs - they
might still remain in the index for a while regardless (it does not
block indexing of those URLs).
If you can work it out and if you feel that your wiki has valuable
content that you would like indexed, it might be worth the trouble to
work out a list of specific kinds of URLs that you would like
disallowed. That way, the URLs where you have unique and compelling
content can still be crawled and indexed. You will generally be able
to spot patterns when you look at your server log files. You can use
those patterns to create disallow directives for your robots.txt
file.