Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Discussions > Crawling, indexing, and ranking > Google Ignores robots.txt file
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
newsblaze  
View profile  
 More options Jun 10 2008, 4:48 pm
From: newsblaze
Date: Tue, 10 Jun 2008 13:48:32 -0700 (PDT)
Local: Tues, Jun 10 2008 4:48 pm
Subject: Google Ignores robots.txt file
Last week, when my server was affected by the explosion at The Planet,
Google indexer found a directory that wasn't really there and althugh
I have that in my robots file, continues to try to index the non-
existent documents. They all return 404. and my webmaster tools shows
thousands of them. Google is reading robots.txt so there are a few
possibilities.
- I don't know how to write a robots file
- google is ignoring robots.txt
- some crawlers have a cached version

http://newsblaze.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
JohnMu Google employee  
View profile  
 More options Jun 10 2008, 6:05 pm
From: JohnMu
Date: Tue, 10 Jun 2008 15:05:11 -0700 (PDT)
Local: Tues, Jun 10 2008 6:05 pm
Subject: Re: Google Ignores robots.txt file
Hi newsblaze (what a name considering what happened to your
hoster :-)) and welcome to the groups!

Looking at your site, I'm not really sure which URLs you don't want
indexed. If you could post a sample URL, it would help to make things
clearer.

When I look at your robots.txt, I can suspect what is happening (but
I'd need to know the URLs that should be blocked to be sure):

You have a fairly large generic ("User-agent: *") section and
relatively small detailed sections (like for "User-agent: googlebot").
Keep in mind that search engine crawlers will only follow the most
specific section and ignore all other sections (including the generic
one). In your case, the Googlebot would ONLY follow the directives
listed in that section. If you want the Googlebot to also follow all
of the directives in your generic section, you will have to copy them
into the "googlebot" section. The same is true for the other search
engine crawlers, so you may need to copy & paste a bit :-).

Hope it helps!
John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robbo  
View profile  
 More options Jun 10 2008, 6:47 pm
From: Robbo
Date: Tue, 10 Jun 2008 15:47:32 -0700 (PDT)
Local: Tues, Jun 10 2008 6:47 pm
Subject: Re: Google Ignores robots.txt file
newsblaze

Are you sure that you really need to treat different search engines
differently?

It would be much simpler to have a set of rules that apply to all
crawlers, without trying to name the various crawlers.

As JohnMu has explained, if you have different sections for each
crawler, each individual crawler will obey ONLY the lines that are in
that section AND IGNORE the general lines (the first ~12 lines in your
robots.txt) which I not think was your intention.

I suspect that by making the robots.txt more complex than necessary,
you have made unnecessary mistakes and granted access to folders/
documents that you do not want ANY crawler to have.

Robbo

On Jun 10, 9:48 pm, newsblaze wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
newsblaze  
View profile  
 More options Jun 13 2008, 2:48 am
From: newsblaze
Date: Thu, 12 Jun 2008 23:48:33 -0700 (PDT)
Local: Fri, Jun 13 2008 2:48 am
Subject: Re: Google Ignores robots.txt file
Thanks Robbo.
That could be true, but only google found the folders that don't
exist.
All I wanted to do was to tell the crawler to stop trying to crawl
what doesn't exist.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
newsblaze  
View profile  
 More options Jun 18 2008, 1:42 pm
From: newsblaze
Date: Wed, 18 Jun 2008 10:42:16 -0700 (PDT)
Local: Wed, Jun 18 2008 1:42 pm
Subject: Re: Google Ignores robots.txt file
Thanks John.
Sorry I forgot to post back.

These are the ones

Disallow: /pix/*/*/mw/

- although there are some images down there that I don't want to
block.
so thats why I don't have it done at the top level.

I cleaned up the robots file to see if that will help.

Alan

On Jun 10, 3:05 pm, JohnMu wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jun 18 2008, 2:16 pm
From: webado
Date: Wed, 18 Jun 2008 11:16:05 -0700 (PDT)
Local: Wed, Jun 18 2008 2:16 pm
Subject: Re: Google Ignores robots.txt file
I see no files at all in the folder called /pix/ .

So you can disallow the whole thing with a simple line:

Disallow: /pix/

Easier than that complex line you have.

On 18 juin, 13:42, newsblaze wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
JohnMu Google employee  
View profile  
 More options Jun 19 2008, 4:01 am
From: JohnMu
Date: Thu, 19 Jun 2008 01:01:32 -0700 (PDT)
Local: Thurs, Jun 19 2008 4:01 am
Subject: Re: Google Ignores robots.txt file
Hi Alan

We generally only process one wildcard in each robots.txt directive.
In your case, I would recommend changing that to:
Disallow: /pix/*/mw/

Keep in mind that this will also block URLs such as /pix/something/mw/
otherthings. If you want to only block those that end in /mw/, you
could use:
Disallow: /pix/*/mw/$

Hope it helps!

John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
newsblaze  
View profile  
 More options Jun 26 2008, 11:08 am
From: newsblaze
Date: Thu, 26 Jun 2008 08:08:08 -0700 (PDT)
Local: Thurs, Jun 26 2008 11:08 am
Subject: Re: Google Ignores robots.txt file
Thank you webado and JohnMu

I have images in that structure:

/pix/2008/0626/pix/Professor_ramakant.jpg

so I don't want to block everything.
and as you see they are dated, so the ones thta need to be blocked are
many:
(and they all return 404)

/pix/2008/0623/mw/
/pix/2008/0624/mw/
/pix/2008/0625/mw/
/pix/2008/0626/mw/

- 4 years of them!

and other names that replace the mw
- except of course the pix

Of course, all that is historical.
If starting now and knowing what I know, I'd have created a different
structure,
but I don't want to bounce thousands of picture URLs out.

Alan.

On Jun 19, 1:01 am, JohnMu wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »