Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Discussions > Crawling, indexing, and ranking > Googlebot crawling *way* too fast?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  7 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Tom Raymond  
View profile  
 More options Aug 27 2008, 11:42 am
From: Tom Raymond
Date: Wed, 27 Aug 2008 08:42:46 -0700 (PDT)
Local: Wed, Aug 27 2008 11:42 am
Subject: Googlebot crawling *way* too fast?
Yesterday, (while I was out sick, naturally) our web site (http://
www.fullcompass.com/) was brought to a near-stop by the Googlebot
crawling our site, requesting page after page non-stop -- maxxing out
our web server's capacity.

My question is:

#1 - how can I find out if this was someone malicious spoofing as
Googlebot?

#2 - if this is, indeed, Googlebot, what can I do to prevent this from
happening again (I've already decreased the crawl speed from 'normal'
to 'slow' via Webmaster Tools)

Thanks in advance!


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Abracadabra  
View profile  
 More options Aug 27 2008, 12:17 pm
From: Tim Abracadabra
Date: Wed, 27 Aug 2008 09:17:35 -0700 (PDT)
Local: Wed, Aug 27 2008 12:17 pm
Subject: Re: Googlebot crawling *way* too fast?
Hi Tom,

> #1 - how can I find out if this was someone malicious spoofing as
> Googlebot?

One tedious method is ti check the server logs and check the
IP address of the Googlebot and do a reverse DNS lookup.

If you can use dig try
dig -x [IP Address] +noadditional +noquestion +nocomments +nocmd
+nostats

Also, Check out this post from Google
http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-goog...

Another thing that might be at issue is your site navigation
and features. You want to insure that only one URL for each
page is available. Block others using robots.txt or
the robots meta tag.
http://www.google.com/support/webmasters/bin/answer.py?answer=61050

You also might consider crawling your site with a tool like
Xenu Link Sleuth. Google it, the utility is a free download.
It will crawl your whole site an provide a report.
You might find duplicate URLs or spider traps that need addressing.
(Note: Set in the options only 4 or less parallel threads to minimize
site load)

I'd run it myself but I don't want to load your server.
You might want to run it during slow traffic periods.

Hope that helps,
Abracadabra
On Aug 27, 11:42 am, Tom Raymond wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
cristina  
View profile  
 More options Aug 27 2008, 1:26 pm
From: cristina
Date: Wed, 27 Aug 2008 10:26:16 -0700 (PDT)
Local: Wed, Aug 27 2008 1:26 pm
Subject: Re: Googlebot crawling *way* too fast?
Abracadabra is right,
I think you will have to
check server access logs or AWStats :(
to see if there was a loop,
multiple redirects,
server errors giving HTTP status response
200 (OK) instead of 5xx, or some very
large files were downloaded, etc.

Check also the query strings of URLs
that were accessed by Googlebot,
in case there were some quasi-endless combinations
of param=values accessed from links or from
deep crawling of HTML forms.

There are some error URLs indexed from your site, see
http://www.google.com/search?q=site:fullcompass.com+intitle:an+error+...
Could they have been indexed because of
redirects at server error
that gave resulting HTTP status response 200 (OK)
instead of 5xx ?
Check that your server returns correctly
HTTP status response 500 at internal server error,
and 404 (Not Found) at all page-not-found URLs,
and add anyway a meta noindex tag to those error pages.

Also, just to mention that the rule
Crawl-Delay: 10
in your robots.txt file is ignored by Googlebot, see
http://www.google.com/support/webmasters/bin/answer.py?answer=35239

Cristina.

On Aug 27, 5:17 pm, Tim Abracadabra wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
JohnMu Google employee  
View profile  
 More options Aug 27 2008, 6:50 pm
From: JohnMu
Date: Wed, 27 Aug 2008 15:50:20 -0700 (PDT)
Local: Wed, Aug 27 2008 6:50 pm
Subject: Re: Googlebot crawling *way* too fast?
Hi Tom and welcome to the groups (and I hope you can stick around for
a while :-))!

Tim and Cristina have already given you some great advice, thanks! I
just want to add some general comments:

In general, situations like this (provided it is really the Googlebot
-- you can check with http://www.google.com/support/webmasters/bin/answer.py?answer=80553
) are generally caused by having a large number of URLs that can be
crawled. Sooner or later, the Googlebot will take you up on it and try
to crawl them :-). The best solution for that is to limit the crawling
of the site to just the desired URLs using a robots.txt file.

In addition to just looking at the URLs crawled, I would take into
consideration how "expensive" they are for your server. Which URLs
require a lot of processing? Which ones can be served without any work
at all? The ones that will cause you trouble are often the ones that
take the most time to get served. In general, on sites like yours,
these will be search pages. I would definitely make sure that all such
search pages are blocked from crawling through robots.txt directives,
especially when you're sure that we'll be able to find all products
naturally through your navigation anyway. In some cases, it's possible
that you'll already be set at this point -- I'd still look at the rest
however and be prepared for the future. Even if you don't have links
to search pages, it's possible that we will try to find new URLs
through them ourselves: http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-h...

Next up, I would take a look at the number of unique URLs that were
crawled (you can usually get this information from your server logs)
and compare that number to the number of products that you actually
have. As a guideline, I would suggest that you have no more than 2-3x
the number of URLs as you have products (note: I just made that number
up :-)). If you see a significantly higher number of URLs than that, I
would suggest that you try to work out why this might be happening.
Perhaps you have ratings on your product pages that create unique URLs
(5 stars = 5x the number of URLs, not to mention that the ratings will
be wrong), perhaps you have feedback pages, email to a friend pages,
add-to-basket URLs, add-to-wishlist URLs, etc. All of these kinds of
URLs will inflate the number of known URLs without providing more
value to users when they're listed in search results, and that in turn
will give us more URLs that we could try to crawl.

In a next step, I would look at temporary URLs, URLs that contain
tracking or other temporary information (like session-data) in the
URLs. In general, the Googlebot does not have to crawl and index these
URLs -- they're just a copy of other existing content. So for these
cases, I would recommend using a 301 redirect to the actual content
URLs. By using the redirect, you're making sure that we don't keep the
temporary URL in our index and try to recrawl it later on. Depending
on your site, this could be another source of savings when it comes to
not having to recrawl these URLs.

Hope it helps!
John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
cristina  
View profile  
 More options Aug 27 2008, 7:08 pm
From: cristina
Date: Wed, 27 Aug 2008 16:08:17 -0700 (PDT)
Local: Wed, Aug 27 2008 7:08 pm
Subject: Re: Googlebot crawling *way* too fast?
Hi John,
I was wondering, sorry :)
if more accurate timestamps in
Google Webmaster Tools for example for
when the robots.txt file was last accessed,
or for web crawl errors, etc.
or maybe even adding
the IP address of the visiting Googlebot,
might help a bit in situations like this
in a quick check of the server access logs,
because
otherwise it might be difficult to know if it
was a spoofing bot or something like that,
and not Googlebot.

Cristina.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
JohnMu Google employee  
View profile  
 More options Aug 29 2008, 8:11 am
From: JohnMu
Date: Fri, 29 Aug 2008 05:11:53 -0700 (PDT)
Local: Fri, Aug 29 2008 8:11 am
Subject: Re: Googlebot crawling *way* too fast?
Hi Cristina
That sounds like a good idea, I'll pass it on to the team to consider
it. Thanks!

John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
cristina  
View profile  
 More options Aug 29 2008, 8:26 am
From: cristina
Date: Fri, 29 Aug 2008 05:26:37 -0700 (PDT)
Local: Fri, Aug 29 2008 8:26 am
Subject: Re: Googlebot crawling *way* too fast?
Hi John,
Thank you.

On Aug 29, 1:11 pm, JohnMu wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »