Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Discussions > Crawling, indexing, and ranking > robots.txt file - What The Heck!?!?!?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  18 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 2:17 am
From: Jay Is The Boss
Date: Fri, 25 Jul 2008 23:17:01 -0700 (PDT)
Local: Sat, Jul 26 2008 2:17 am
Subject: robots.txt file - What The Heck!?!?!?
Hi everyone,

Google hates my robots.txt file ( http://www.siamese-dream.com/robots.txt
)

Seriously. Maybe I am doing something wrong with the formatting; maybe
the gods are playing tricks on me; maybe it's the "phantom hacker"
everyone blames when they screw up their own site...

In my robots.txt file, I have some statements like this:

User-agent: *
Disallow: /page/siam1/PROD/Our-Favorites

Yet pages with a URL like this still get indexed:

/page/siam1/PROD/Our-Favorites/sing-bowl-tibet

Do I need to put a trailing slash ( one of these / thingies) at the
end of  /page/siam1/PROD/Our-Favorites

Oddly enough, when I go into web master tools and the analyze
robots.txt area, it says everything is fine. Google downloads my
robots.txt file every day, and returns a 200 code.

and when I use the Test URLs against this robots.txt file it comes
back that it SHOULD be blocked by my robots.txt file

But it ain't blocked. It still gets indexed.

Anyway, the only thing I can think of is POSSIBLY somehow this is
messed up because they are virtual directories (I THINK that is the
name for them, but I could be wrong - maybe they are 'symbolic'
directories?). What I mean is that I have a rewrite in my .htaccess
file that changes long and inarticulate urls such as

http://www.siamese-dream.com/Merchant2/merchant.mvc?Screen=PROD&Store...

into shorter URLs, such as this:

http://www.siamese-dream.com/page/siam1/PROD/Wall-Calendars/Dalai-Lam...

The rewrite condition looks like this (I think):

#Options +FollowSymLinks

RewriteEngine On

RewriteCond "%{QUERY_STRING}" =""
RewriteCond %{REQUEST_METHOD} ^GET$
RewriteRule ^Merchant2/merchant.mvc$ http://www.siamese-dream.com/Merchant2/merchant.mvc?Store_Code=siam1
[R,L]

RewriteCond %{HTTP_HOST} !^www.siamese-dream.com$ [NC]
RewriteRule ^(.*) http://www.siamese-dream.com/$1 [L,R=301]

RewriteRule ^page/(.*) /Merchant2/merchant.mvc?page=$1 [L]

So is there possibly something in the .htaccess file that is messing
up google's handling of the robots.txt file?

Thanks in advance,

Mark


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Phil Payne  
View profile  
 More options Jul 26 2008, 2:30 am
From: Phil Payne
Date: Fri, 25 Jul 2008 23:30:27 -0700 (PDT)
Local: Sat, Jul 26 2008 2:30 am
Subject: Re: robots.txt file - What The Heck!?!?!?
On Jul 26, 7:17 am, Jay Is The Boss wrote:

> Hi everyone,
> Google hates my robots.txt file (http://www.siamese-dream.com/robots.txt
> )

Yup, it probably does.  It starts with a byte-order-mark:

(EF,BB,BF)User-agent:·*(LF)
Disallow:·/cgi-bin/(LF)
Disallow:·/newsadmin/(LF)
Disallow:·/php_stuff/(LF)
Disallow:·/phpMyAdmin/(LF)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
cristina  
View profile  
 More options Jul 26 2008, 8:52 am
From: cristina
Date: Sat, 26 Jul 2008 05:52:33 -0700 (PDT)
Local: Sat, Jul 26 2008 8:52 am
Subject: Re: robots.txt file - What The Heck!?!?!?
Another thing is that the robots.txt file
is quite large, about 26kB.

Do you have web crawl messages for
URLs blocked by robots.txt in
Google Webmaster Tools?

Cristina.

On Jul 26, 7:30 am, Phil Payne wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 11:27 am
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 08:27:23 -0700 (PDT)
Local: Sat, Jul 26 2008 11:27 am
Subject: Re: robots.txt file - What The Heck!?!?!?
Hi there, Phil:

Thanks for taking the time to look at it and respond.

You Said:

> It starts with a byte-order-mark:
> (EF,BB,BF)User-agent:·*(LF)
> Disallow:·/cgi-bin/(LF)
> Disallow:·/newsadmin/(LF)

I am a little confused. Do you mean that the (EF,BB,BF) and the (LF)
are IN my robots.txt file currently? I don't see them when I view the
robots.txt file in firefox, nor when I download it and view it in a
program like notepad...

Or do you mean that they SHOULD be in the file but that I don't have
them in there.

Thanks in advance.

Mark

On Jul 26, 5:52 am, cristina wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 11:32 am
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 08:32:00 -0700 (PDT)
Subject: Re: robots.txt file - What The Heck!?!?!?
Cristina,

Thanks for writing. You asked:

> Do you have web crawl messages for
> URLs blocked by robots.txt in
> Google Webmaster Tools?

Yes, actually I get lots of duplicate Meta Description warnings for
the same product (I have an e-commerce site) when it appears in two
categories, and I only want it to be indexed in one category.

The product might be in one category (such as Bowls) and then in
another category such as Best-Sellers or Our-Favorites, or Clearance-
Items and I would like google to NOT index those categories so as to
avoid duplicate meta descriptions and duplicate titles.

Thanks.

On Jul 26, 5:52 am, cristina wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 26 2008, 11:59 am
From: webado
Date: Sat, 26 Jul 2008 08:59:41 -0700 (PDT)
Local: Sat, Jul 26 2008 11:59 am
Subject: Re: robots.txt file - What The Heck!?!?!?
Don't bother to disallwo urls which contain a session id - this will
not happen since the session id changes all the time.
Work with the uri prefix.

For instance, isntead of:
Disallow: /Merchant2/merchant.mvc?
Session_ID=ccc2479af93cbb60c6605662a2bde2af&Screen=CTGY&Store_Code=siam1&Ca tegory_Code=Chinese-
Shirts

and other like it, use only a single:

Disallow: /Merchant2/merchant.mvc?Session_ID

Work on your robots.txt file to see if you cannot group directives by
uri prefix rather than list each one separately.

On Jul 26, 11:32 am, Jay Is The Boss wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 26 2008, 12:04 pm
From: webado
Date: Sat, 26 Jul 2008 09:04:56 -0700 (PDT)
Local: Sat, Jul 26 2008 12:04 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
I see pages indexed which ought to be disallwoed as of July 8.
When did you change your robots.tx file to disallow for instance
 /page/siam1/PROD/Our-Favorites

If it's after that date, then you need to wait until that part of the
site gets scheduled to be recrawled.

Of course if the robots.txt file shows the BOM then this needs to get
fixed first.

On Jul 26, 11:59 am, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 1:04 pm
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 10:04:26 -0700 (PDT)
Local: Sat, Jul 26 2008 1:04 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Hi there, Webado;

Thanks for your responses. You mentioned:

> When did you change your robots.tx file to disallow for instance

 > /page/siam1/PROD/Our-Favorites

I am not 100% certain. It might not have been until July 15th,
although I suspected it was earlier.

When I look at my diagnostics and crawl stats in Web Master tools, it
says last updated July 25th (today's date, by the way) but it might
ALWAYS say today's date, even if it hasn't crawled in a while...

Also, you had said this:

>Of course if the robots.txt file shows the BOM then this needs to get
> fixed first.

I am sorry, but what do you mean by "BOM" ? It is probably something
totally obvious but I am having a brain cramp...

Thanks in advance,

Mark

On Jul 26, 9:04 am, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 26 2008, 1:54 pm
From: webado
Date: Sat, 26 Jul 2008 10:54:48 -0700 (PDT)
Local: Sat, Jul 26 2008 1:54 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
BOM = byte water mark  - the rogue invisible characters present in the
file before the first actual visible character of the file.

What Phil mentioned he saw using the http viewer.

On Jul 26, 1:04 pm, Jay Is The Boss wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 2:23 pm
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 11:23:25 -0700 (PDT)
Local: Sat, Jul 26 2008 2:23 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Thanks again, Webado!

Do you recommend a particular text editor for window so that I can
spot / remove the Byte Water Marks?

I usually edit the robots.txt file in notepad and I can't see them.

Thanks in advance,

Mark

On Jul 26, 10:54 am, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
JohnMu Google employee  
View profile  
 More options Jul 26 2008, 2:26 pm
From: JohnMu
Date: Sat, 26 Jul 2008 11:26:58 -0700 (PDT)
Local: Sat, Jul 26 2008 2:26 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Hi Mark and welcome to the groups!

Phil was right on target there, it seems the BOM at the beginning of
the file might be throwing us off. The easiest way to get around this
issue is to have an empty line (or a comment) in the top of your
robots.txt file -- that way it'll work even if you have a BOM in your
file.

Hope it helps!
John

PS Webado's right in that it makes sense to use URL fragments which
are as general as possible in the robots.txt. That way, you do not
need that many disallow directives, which makes it easier for you to
keep track of what is disallowed (and helps you to maintain it over
time).


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 2:43 pm
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 11:43:44 -0700 (PDT)
Local: Sat, Jul 26 2008 2:43 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Thank you, John!

I used Crimson Editor to view and edit the file, so I am hoping that I
was able to remove the byte order marks

(they seemed to me to be things like an upside down question mark,
upside down exclamation point, and some other symbol I had never seen
before, as opposed to the (EF,BB,BF) that Mr. Payne had mentioned)

Again, if anyone has a reccommendation for a particularly good text
editor to use with robots.txt files and .htaccess files, I would
greatly appreciate it.

Thanks for all the help,

Mark

On Jul 26, 11:26 am, JohnMu wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Phil Payne  
View profile  
 More options Jul 26 2008, 2:47 pm
From: Phil Payne
Date: Sat, 26 Jul 2008 11:47:32 -0700 (PDT)
Local: Sat, Jul 26 2008 2:47 pm
Subject: Re: robots.txt file - What The Heck!?!?!?

> Phil was right on target there, it seems the BOM at the beginning of
> the file might be throwing us off. The easiest way to get around this
> issue is to have an empty line (or a comment) in the top of your
> robots.txt file -- that way it'll work even if you have a BOM in your
> file.

It _should_ throw you off, since robots.txt is implicitly UTF-8 and
the BOM isn't.

It shouldn't make any difference whether the BOM hits a comment line
or anything else - its presence invalidates UTF-8 and the crawler
should discard it.

I viewed it with http://www.rexswain.com/httpview.html


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 26 2008, 2:52 pm
From: webado
Date: Sat, 26 Jul 2008 11:52:48 -0700 (PDT)
Local: Sat, Jul 26 2008 2:52 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Seems to be fine now when viewed in
http://www.rexswain.com/httpview.html

I wonder what might  have caused it in the first place.

On Jul 26, 2:43 pm, Jay Is The Boss wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 3:30 pm
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 12:30:29 -0700 (PDT)
Local: Sat, Jul 26 2008 3:30 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Hi again, Phil and webado;

Thanks for the link to the viewer.

After I read webado's previous post, I just downloaded, edited in
crimson editor, and uploaded the robots.txt file again, so I guess
that has got it.

I usually edit in something simple like notepad, so maybe that added
it?

Thanks again for all your help,

Mark

On Jul 26, 11:52 am, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 3:39 pm
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 12:39:05 -0700 (PDT)
Local: Sat, Jul 26 2008 3:39 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Hey webado:

One more thing you might be able to help with, since you touched upon
it before:

> Work on your robots.txt file to see if you cannot group directives by
> uri prefix rather than list each one separately.

Since I have many URIs int the robots.txt file that are somewhat
similar, such as thers below:

Disallow:·/page/siam1/PROD/angie-tops-22Y30-PU13
Disallow:·/page/siam1/PROD/angie-tops-22Z43-FC65
Disallow:·/page/siam1/PROD/angie-dress-L4021-WJ74a
Disallow:·/page/siam1/PROD/angie-dress-Q4D19-EJ80

Can I instead have JUST ONE LINE that would be:

Disallow:·/page/siam1/PROD/angie-

And that would block all of those four entries?

Thanks in advance,

Mark

On Jul 26, 8:59 am, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Phil Payne  
View profile  
 More options Jul 26 2008, 7:09 pm
From: Phil Payne
Date: Sat, 26 Jul 2008 16:09:40 -0700 (PDT)
Local: Sat, Jul 26 2008 7:09 pm
Subject: Re: robots.txt file - What The Heck!?!?!?

> Disallow:·/page/siam1/PROD/angie-tops-22Y30-PU13
> Disallow:·/page/siam1/PROD/angie-tops-22Z43-FC65
> Disallow:·/page/siam1/PROD/angie-dress-L4021-WJ74a
> Disallow:·/page/siam1/PROD/angie-dress-Q4D19-EJ80

> Can I instead have JUST ONE LINE that would be:

> Disallow:·/page/siam1/PROD/angie-

> And that would block all of those four entries?

Absolutely.

In the de facto standard, the relative URI you provide is described as
a "URI prefix" - anything matching those characters up to the end of
what you specifiy will be excluded by a well-behaved bot.  It's a kind
of implicit wildcard - you can, if you wish, imagine a * at the end of
what you specifiy.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jay Is The Boss  
View profile  
 More options Jul 26 2008, 8:53 pm
From: Jay Is The Boss
Date: Sat, 26 Jul 2008 17:53:45 -0700 (PDT)
Local: Sat, Jul 26 2008 8:53 pm
Subject: Re: robots.txt file - What The Heck!?!?!?
Thank you, Phil!

Mark

On Jul 26, 4:09 pm, Phil Payne wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »