Seriously. Maybe I am doing something wrong with the formatting; maybe
the gods are playing tricks on me; maybe it's the "phantom hacker"
everyone blames when they screw up their own site...
In my robots.txt file, I have some statements like this:
Do I need to put a trailing slash ( one of these / thingies) at the
end of /page/siam1/PROD/Our-Favorites
Oddly enough, when I go into web master tools and the analyze
robots.txt area, it says everything is fine. Google downloads my
robots.txt file every day, and returns a 200 code.
and when I use the Test URLs against this robots.txt file it comes
back that it SHOULD be blocked by my robots.txt file
But it ain't blocked. It still gets indexed.
Anyway, the only thing I can think of is POSSIBLY somehow this is
messed up because they are virtual directories (I THINK that is the
name for them, but I could be wrong - maybe they are 'symbolic'
directories?). What I mean is that I have a rewrite in my .htaccess
file that changes long and inarticulate urls such as
Thanks for taking the time to look at it and respond.
You Said:
> It starts with a byte-order-mark:
> (EF,BB,BF)User-agent:·*(LF)
> Disallow:·/cgi-bin/(LF)
> Disallow:·/newsadmin/(LF)
I am a little confused. Do you mean that the (EF,BB,BF) and the (LF)
are IN my robots.txt file currently? I don't see them when I view the
robots.txt file in firefox, nor when I download it and view it in a
program like notepad...
Or do you mean that they SHOULD be in the file but that I don't have
them in there.
> Do you have web crawl messages for
> URLs blocked by robots.txt in
> Google Webmaster Tools?
Yes, actually I get lots of duplicate Meta Description warnings for
the same product (I have an e-commerce site) when it appears in two
categories, and I only want it to be indexed in one category.
The product might be in one category (such as Bowls) and then in
another category such as Best-Sellers or Our-Favorites, or Clearance-
Items and I would like google to NOT index those categories so as to
avoid duplicate meta descriptions and duplicate titles.
> > Do you have web crawl messages for
> > URLs blocked by robots.txt in
> > Google Webmaster Tools?
> Yes, actually I get lots of duplicate Meta Description warnings for
> the same product (I have an e-commerce site) when it appears in two
> categories, and I only want it to be indexed in one category.
> The product might be in one category (such as Bowls) and then in
> another category such as Best-Sellers or Our-Favorites, or Clearance-
> Items and I would like google to NOT index those categories so as to
> avoid duplicate meta descriptions and duplicate titles.
> Thanks.
> On Jul 26, 5:52 am, cristina wrote:
> > Another thing is that the robots.txt file
> > is quite large, about 26kB.
> > Do you have web crawl messages for
> > URLs blocked by robots.txt in
> > Google Webmaster Tools?
> > Cristina.
> > On Jul 26, 7:30 am, Phil Payne wrote:
> > > It starts with a byte-order-mark- Hide quoted text -
I see pages indexed which ought to be disallwoed as of July 8.
When did you change your robots.tx file to disallow for instance
/page/siam1/PROD/Our-Favorites
If it's after that date, then you need to wait until that part of the
site gets scheduled to be recrawled.
Of course if the robots.txt file shows the BOM then this needs to get
fixed first.
> Don't bother to disallwo urls which contain a session id - this will
> not happen since the session id changes all the time.
> Work with the uri prefix.
> Work on your robots.txt file to see if you cannot group directives by
> uri prefix rather than list each one separately.
> On Jul 26, 11:32 am, Jay Is The Boss wrote:
> > Cristina,
> > Thanks for writing. You asked:
> > > Do you have web crawl messages for
> > > URLs blocked by robots.txt in
> > > Google Webmaster Tools?
> > Yes, actually I get lots of duplicate Meta Description warnings for
> > the same product (I have an e-commerce site) when it appears in two
> > categories, and I only want it to be indexed in one category.
> > The product might be in one category (such as Bowls) and then in
> > another category such as Best-Sellers or Our-Favorites, or Clearance-
> > Items and I would like google to NOT index those categories so as to
> > avoid duplicate meta descriptions and duplicate titles.
> > Thanks.
> > On Jul 26, 5:52 am, cristina wrote:
> > > Another thing is that the robots.txt file
> > > is quite large, about 26kB.
> > > Do you have web crawl messages for
> > > URLs blocked by robots.txt in
> > > Google Webmaster Tools?
> > > Cristina.
> > > On Jul 26, 7:30 am, Phil Payne wrote:
> > > > It starts with a byte-order-mark- Hide quoted text -
> When did you change your robots.tx file to disallow for instance
> /page/siam1/PROD/Our-Favorites
I am not 100% certain. It might not have been until July 15th,
although I suspected it was earlier.
When I look at my diagnostics and crawl stats in Web Master tools, it
says last updated July 25th (today's date, by the way) but it might
ALWAYS say today's date, even if it hasn't crawled in a while...
Also, you had said this:
>Of course if the robots.txt file shows the BOM then this needs to get
> fixed first.
I am sorry, but what do you mean by "BOM" ? It is probably something
totally obvious but I am having a brain cramp...
> I see pages indexed which ought to be disallwoed as of July 8.
> When did you change your robots.tx file to disallow for instance
> /page/siam1/PROD/Our-Favorites
> If it's after that date, then you need to wait until that part of the
> site gets scheduled to be recrawled.
> Of course if the robots.txt file shows the BOM then this needs to get
> fixed first.
> On Jul 26, 11:59 am, webado wrote:
> > Don't bother to disallwo urls which contain a session id - this will
> > not happen since the session id changes all the time.
> > Work with the uri prefix.
> > Work on your robots.txt file to see if you cannot group directives by
> > uri prefix rather than list each one separately.
> > On Jul 26, 11:32 am, Jay Is The Boss wrote:
> > > Cristina,
> > > Thanks for writing. You asked:
> > > > Do you have web crawl messages for
> > > > URLs blocked by robots.txt in
> > > > Google Webmaster Tools?
> > > Yes, actually I get lots of duplicate Meta Description warnings for
> > > the same product (I have an e-commerce site) when it appears in two
> > > categories, and I only want it to be indexed in one category.
> > > The product might be in one category (such as Bowls) and then in
> > > another category such as Best-Sellers or Our-Favorites, or Clearance-
> > > Items and I would like google to NOT index those categories so as to
> > > avoid duplicate meta descriptions and duplicate titles.
> > > Thanks.
> > > On Jul 26, 5:52 am, cristina wrote:
> > > > Another thing is that the robots.txt file
> > > > is quite large, about 26kB.
> > > > Do you have web crawl messages for
> > > > URLs blocked by robots.txt in
> > > > Google Webmaster Tools?
> > > > Cristina.
> > > > On Jul 26, 7:30 am, Phil Payne wrote:
> > > > > It starts with a byte-order-mark- Hide quoted text -
> > When did you change your robots.tx file to disallow for instance
> > /page/siam1/PROD/Our-Favorites
> I am not 100% certain. It might not have been until July 15th,
> although I suspected it was earlier.
> When I look at my diagnostics and crawl stats in Web Master tools, it
> says last updated July 25th (today's date, by the way) but it might
> ALWAYS say today's date, even if it hasn't crawled in a while...
> Also, you had said this:
> >Of course if the robots.txt file shows the BOM then this needs to get
> > fixed first.
> I am sorry, but what do you mean by "BOM" ? It is probably something
> totally obvious but I am having a brain cramp...
> Thanks in advance,
> Mark
> On Jul 26, 9:04 am, webado wrote:
> > I see pages indexed which ought to be disallwoed as of July 8.
> > When did you change your robots.tx file to disallow for instance
> > /page/siam1/PROD/Our-Favorites
> > If it's after that date, then you need to wait until that part of the
> > site gets scheduled to be recrawled.
> > Of course if the robots.txt file shows the BOM then this needs to get
> > fixed first.
> > On Jul 26, 11:59 am, webado wrote:
> > > Don't bother to disallwo urls which contain a session id - this will
> > > not happen since the session id changes all the time.
> > > Work with the uri prefix.
> > > Work on your robots.txt file to see if you cannot group directives by
> > > uri prefix rather than list each one separately.
> > > On Jul 26, 11:32 am, Jay Is The Boss wrote:
> > > > Cristina,
> > > > Thanks for writing. You asked:
> > > > > Do you have web crawl messages for
> > > > > URLs blocked by robots.txt in
> > > > > Google Webmaster Tools?
> > > > Yes, actually I get lots of duplicate Meta Description warnings for
> > > > the same product (I have an e-commerce site) when it appears in two
> > > > categories, and I only want it to be indexed in one category.
> > > > The product might be in one category (such as Bowls) and then in
> > > > another category such as Best-Sellers or Our-Favorites, or Clearance-
> > > > Items and I would like google to NOT index those categories so as to
> > > > avoid duplicate meta descriptions and duplicate titles.
> > > > Thanks.
> > > > On Jul 26, 5:52 am, cristina wrote:
> > > > > Another thing is that the robots.txt file
> > > > > is quite large, about 26kB.
> > > > > Do you have web crawl messages for
> > > > > URLs blocked by robots.txt in
> > > > > Google Webmaster Tools?
> > > > > Cristina.
> > > > > On Jul 26, 7:30 am, Phil Payne wrote:
> > > > > > It starts with a byte-order-mark- Hide quoted text -
> BOM = byte water mark - the rogue invisible characters present in the
> file before the first actual visible character of the file.
> What Phil mentioned he saw using the http viewer.
> On Jul 26, 1:04 pm, Jay Is The Boss wrote:
> > Hi there, Webado;
> > Thanks for your responses. You mentioned:
> > > When did you change your robots.tx file to disallow for instance
> > > /page/siam1/PROD/Our-Favorites
> > I am not 100% certain. It might not have been until July 15th,
> > although I suspected it was earlier.
> > When I look at my diagnostics and crawl stats in Web Master tools, it
> > says last updated July 25th (today's date, by the way) but it might
> > ALWAYS say today's date, even if it hasn't crawled in a while...
> > Also, you had said this:
> > >Of course if the robots.txt file shows the BOM then this needs to get
> > > fixed first.
> > I am sorry, but what do you mean by "BOM" ? It is probably something
> > totally obvious but I am having a brain cramp...
> > Thanks in advance,
> > Mark
> > On Jul 26, 9:04 am, webado wrote:
> > > I see pages indexed which ought to be disallwoed as of July 8.
> > > When did you change your robots.tx file to disallow for instance
> > > /page/siam1/PROD/Our-Favorites
> > > If it's after that date, then you need to wait until that part of the
> > > site gets scheduled to be recrawled.
> > > Of course if the robots.txt file shows the BOM then this needs to get
> > > fixed first.
> > > On Jul 26, 11:59 am, webado wrote:
> > > > Don't bother to disallwo urls which contain a session id - this will
> > > > not happen since the session id changes all the time.
> > > > Work with the uri prefix.
> > > > Work on your robots.txt file to see if you cannot group directives by
> > > > uri prefix rather than list each one separately.
> > > > On Jul 26, 11:32 am, Jay Is The Boss wrote:
> > > > > Cristina,
> > > > > Thanks for writing. You asked:
> > > > > > Do you have web crawl messages for
> > > > > > URLs blocked by robots.txt in
> > > > > > Google Webmaster Tools?
> > > > > Yes, actually I get lots of duplicate Meta Description warnings for
> > > > > the same product (I have an e-commerce site) when it appears in two
> > > > > categories, and I only want it to be indexed in one category.
> > > > > The product might be in one category (such as Bowls) and then in
> > > > > another category such as Best-Sellers or Our-Favorites, or Clearance-
> > > > > Items and I would like google to NOT index those categories so as to
> > > > > avoid duplicate meta descriptions and duplicate titles.
> > > > > Thanks.
> > > > > On Jul 26, 5:52 am, cristina wrote:
> > > > > > Another thing is that the robots.txt file
> > > > > > is quite large, about 26kB.
> > > > > > Do you have web crawl messages for
> > > > > > URLs blocked by robots.txt in
> > > > > > Google Webmaster Tools?
> > > > > > Cristina.
> > > > > > On Jul 26, 7:30 am, Phil Payne wrote:
> > > > > > > It starts with a byte-order-mark- Hide quoted text -
> > > > > - Show quoted text -- Hide quoted text -
Phil was right on target there, it seems the BOM at the beginning of
the file might be throwing us off. The easiest way to get around this
issue is to have an empty line (or a comment) in the top of your
robots.txt file -- that way it'll work even if you have a BOM in your
file.
Hope it helps!
John
PS Webado's right in that it makes sense to use URL fragments which
are as general as possible in the robots.txt. That way, you do not
need that many disallow directives, which makes it easier for you to
keep track of what is disallowed (and helps you to maintain it over
time).
I used Crimson Editor to view and edit the file, so I am hoping that I
was able to remove the byte order marks
(they seemed to me to be things like an upside down question mark,
upside down exclamation point, and some other symbol I had never seen
before, as opposed to the (EF,BB,BF) that Mr. Payne had mentioned)
Again, if anyone has a reccommendation for a particularly good text
editor to use with robots.txt files and .htaccess files, I would
greatly appreciate it.
> Phil was right on target there, it seems the BOM at the beginning of
> the file might be throwing us off. The easiest way to get around this
> issue is to have an empty line (or a comment) in the top of your
> robots.txt file -- that way it'll work even if you have a BOM in your
> file.
> Hope it helps!
> John
> PS Webado's right in that it makes sense to use URL fragments which
> are as general as possible in the robots.txt. That way, you do not
> need that many disallow directives, which makes it easier for you to
> keep track of what is disallowed (and helps you to maintain it over
> time).
> Phil was right on target there, it seems the BOM at the beginning of
> the file might be throwing us off. The easiest way to get around this
> issue is to have an empty line (or a comment) in the top of your
> robots.txt file -- that way it'll work even if you have a BOM in your
> file.
It _should_ throw you off, since robots.txt is implicitly UTF-8 and
the BOM isn't.
It shouldn't make any difference whether the BOM hits a comment line
or anything else - its presence invalidates UTF-8 and the crawler
should discard it.
> I used Crimson Editor to view and edit the file, so I am hoping that I
> was able to remove the byte order marks
> (they seemed to me to be things like an upside down question mark,
> upside down exclamation point, and some other symbol I had never seen
> before, as opposed to the (EF,BB,BF) that Mr. Payne had mentioned)
> Again, if anyone has a reccommendation for a particularly good text
> editor to use with robots.txt files and .htaccess files, I would
> greatly appreciate it.
> Thanks for all the help,
> Mark
> On Jul 26, 11:26 am, JohnMu wrote:
> > Hi Mark and welcome to the groups!
> > Phil was right on target there, it seems the BOM at the beginning of
> > the file might be throwing us off. The easiest way to get around this
> > issue is to have an empty line (or a comment) in the top of your
> > robots.txt file -- that way it'll work even if you have a BOM in your
> > file.
> > Hope it helps!
> > John
> > PS Webado's right in that it makes sense to use URL fragments which
> > are as general as possible in the robots.txt. That way, you do not
> > need that many disallow directives, which makes it easier for you to
> > keep track of what is disallowed (and helps you to maintain it over
> > time).- Hide quoted text -
> I wonder what might have caused it in the first place.
> On Jul 26, 2:43 pm, Jay Is The Boss wrote:
> > Thank you, John!
> > I used Crimson Editor to view and edit the file, so I am hoping that I
> > was able to remove the byte order marks
> > (they seemed to me to be things like an upside down question mark,
> > upside down exclamation point, and some other symbol I had never seen
> > before, as opposed to the (EF,BB,BF) that Mr. Payne had mentioned)
> > Again, if anyone has a reccommendation for a particularly good text
> > editor to use with robots.txt files and .htaccess files, I would
> > greatly appreciate it.
> > Thanks for all the help,
> > Mark
> > On Jul 26, 11:26 am, JohnMu wrote:
> > > Hi Mark and welcome to the groups!
> > > Phil was right on target there, it seems the BOM at the beginning of
> > > the file might be throwing us off. The easiest way to get around this
> > > issue is to have an empty line (or a comment) in the top of your
> > > robots.txt file -- that way it'll work even if you have a BOM in your
> > > file.
> > > Hope it helps!
> > > John
> > > PS Webado's right in that it makes sense to use URL fragments which
> > > are as general as possible in the robots.txt. That way, you do not
> > > need that many disallow directives, which makes it easier for you to
> > > keep track of what is disallowed (and helps you to maintain it over
> > > time).- Hide quoted text -
> Don't bother to disallwo urls which contain a session id - this will
> not happen since the session id changes all the time.
> Work with the uri prefix.
> Work on your robots.txt file to see if you cannot group directives by
> uri prefix rather than list each one separately.
> On Jul 26, 11:32 am, Jay Is The Boss wrote:
> > Cristina,
> > Thanks for writing. You asked:
> > > Do you have web crawl messages for
> > > URLs blocked by robots.txt in
> > > Google Webmaster Tools?
> > Yes, actually I get lots of duplicate Meta Description warnings for
> > the same product (I have an e-commerce site) when it appears in two
> > categories, and I only want it to be indexed in one category.
> > The product might be in one category (such as Bowls) and then in
> > another category such as Best-Sellers or Our-Favorites, or Clearance-
> > Items and I would like google to NOT index those categories so as to
> > avoid duplicate meta descriptions and duplicate titles.
> > Thanks.
> > On Jul 26, 5:52 am, cristina wrote:
> > > Another thing is that the robots.txt file
> > > is quite large, about 26kB.
> > > Do you have web crawl messages for
> > > URLs blocked by robots.txt in
> > > Google Webmaster Tools?
> > > Cristina.
> > > On Jul 26, 7:30 am, Phil Payne wrote:
> > > > It starts with a byte-order-mark- Hide quoted text -
In the de facto standard, the relative URI you provide is described as
a "URI prefix" - anything matching those characters up to the end of
what you specifiy will be excluded by a well-behaved bot. It's a kind
of implicit wildcard - you can, if you wish, imagine a * at the end of
what you specifiy.
> > Can I instead have JUST ONE LINE that would be:
> > Disallow:·/page/siam1/PROD/angie-
> > And that would block all of those four entries?
> Absolutely.
> In the de facto standard, the relative URI you provide is described as
> a "URI prefix" - anything matching those characters up to the end of
> what you specifiy will be excluded by a well-behaved bot. It's a kind
> of implicit wildcard - you can, if you wish, imagine a * at the end of
> what you specifiy.