Last week, when my server was affected by the explosion at The Planet,
Google indexer found a directory that wasn't really there and althugh
I have that in my robots file, continues to try to index the non-
existent documents. They all return 404. and my webmaster tools shows
thousands of them. Google is reading robots.txt so there are a few
possibilities.
- I don't know how to write a robots file
- google is ignoring robots.txt
- some crawlers have a cached version
Hi newsblaze (what a name considering what happened to your
hoster :-)) and welcome to the groups!
Looking at your site, I'm not really sure which URLs you don't want
indexed. If you could post a sample URL, it would help to make things
clearer.
When I look at your robots.txt, I can suspect what is happening (but
I'd need to know the URLs that should be blocked to be sure):
You have a fairly large generic ("User-agent: *") section and
relatively small detailed sections (like for "User-agent: googlebot").
Keep in mind that search engine crawlers will only follow the most
specific section and ignore all other sections (including the generic
one). In your case, the Googlebot would ONLY follow the directives
listed in that section. If you want the Googlebot to also follow all
of the directives in your generic section, you will have to copy them
into the "googlebot" section. The same is true for the other search
engine crawlers, so you may need to copy & paste a bit :-).
Are you sure that you really need to treat different search engines
differently?
It would be much simpler to have a set of rules that apply to all
crawlers, without trying to name the various crawlers.
As JohnMu has explained, if you have different sections for each
crawler, each individual crawler will obey ONLY the lines that are in
that section AND IGNORE the general lines (the first ~12 lines in your
robots.txt) which I not think was your intention.
I suspect that by making the robots.txt more complex than necessary,
you have made unnecessary mistakes and granted access to folders/
documents that you do not want ANY crawler to have.
> Last week, when my server was affected by the explosion at The Planet,
> Google indexer found a directory that wasn't really there and althugh
> I have that in my robots file, continues to try to index the non-
> existent documents. They all return 404. and my webmaster tools shows
> thousands of them. Google is reading robots.txt so there are a few
> possibilities.
> - I don't know how to write a robots file
> - google is ignoring robots.txt
> - some crawlers have a cached version
Thanks Robbo.
That could be true, but only google found the folders that don't
exist.
All I wanted to do was to tell the crawler to stop trying to crawl
what doesn't exist.
> Hinewsblaze(what a name considering what happened to your
> hoster :-)) and welcome to the groups!
> Looking at your site, I'm not really sure which URLs you don't want
> indexed. If you could post a sample URL, it would help to make things
> clearer.
> When I look at your robots.txt, I can suspect what is happening (but
> I'd need to know the URLs that should be blocked to be sure):
> You have a fairly large generic ("User-agent: *") section and
> relatively small detailed sections (like for "User-agent: googlebot").
> Keep in mind that search engine crawlers will only follow the most
> specific section and ignore all other sections (including the generic
> one). In your case, the Googlebot would ONLY follow the directives
> listed in that section. If you want the Googlebot to also follow all
> of the directives in your generic section, you will have to copy them
> into the "googlebot" section. The same is true for the other search
> engine crawlers, so you may need to copy & paste a bit :-).
> - although there are some images down there that I don't want to
> block.
> so thats why I don't have it done at the top level.
> I cleaned up the robots file to see if that will help.
> Alan
> On Jun 10, 3:05 pm, JohnMu wrote:
> > Hinewsblaze(what a name considering what happened to your
> > hoster :-)) and welcome to the groups!
> > Looking at your site, I'm not really sure which URLs you don't want
> > indexed. If you could post a sample URL, it would help to make things
> > clearer.
> > When I look at your robots.txt, I can suspect what is happening (but
> > I'd need to know the URLs that should be blocked to be sure):
> > You have a fairly large generic ("User-agent: *") section and
> > relatively small detailed sections (like for "User-agent: googlebot").
> > Keep in mind that search engine crawlers will only follow the most
> > specific section and ignore all other sections (including the generic
> > one). In your case, the Googlebot would ONLY follow the directives
> > listed in that section. If you want the Googlebot to also follow all
> > of the directives in your generic section, you will have to copy them
> > into the "googlebot" section. The same is true for the other search
> > engine crawlers, so you may need to copy & paste a bit :-).
> > Hope it helps!
> > John- Masquer le texte des messages précédents -
We generally only process one wildcard in each robots.txt directive.
In your case, I would recommend changing that to:
Disallow: /pix/*/mw/
Keep in mind that this will also block URLs such as /pix/something/mw/
otherthings. If you want to only block those that end in /mw/, you
could use:
Disallow: /pix/*/mw/$
and other names that replace the mw
- except of course the pix
Of course, all that is historical.
If starting now and knowing what I know, I'd have created a different
structure,
but I don't want to bounce thousands of picture URLs out.
> We generally only process one wildcard in each robots.txt directive.
> In your case, I would recommend changing that to:
> Disallow: /pix/*/mw/
> Keep in mind that this will also block URLs such as /pix/something/mw/
> otherthings. If you want to only block those that end in /mw/, you
> could use:
> Disallow: /pix/*/mw/$