Hi.
We have implemented a URL validation step when we process a response
to make sure that when people call a page they use the correct URL.
If they use an incorrect URL, then they are sent a 301 redirect with
the correct URL.
The problem with Googlebot is that even though that is the URL we put
in the sitemap, it doesn't use that URL to make the request - it
contracts it down to:
http://www.domain.com/?whatever=value
So our server sees this 'incorrect' URL, issues a 301 with the
'correct' URL (that has the index.html bit in it), but then Googlebot
doesn't follow that URL faithfully and again tries to request the URL
without index.html in the path. So our server again issues a 301
redirect, with the correct URL and here we go off on our infinite
loop.
So no wonder we get the error message:
URLs not followed.... [sitemap] contained too many redirects.
I think this is a bug as the 301 redirect clearly sends the redirect
URL, if Googlebot followed this redirect URL faithfully then we
wouldn't see this issue.
I'd appreciate a response from the Google team, Vanessa?
Googlebot doesn't invent urls. It found that url somewhere. It doesn't
only crawl urls you have in the sitemap, but also all urls found
anythwere on the web (including on your site) that appear to be on
your domain.
That said, you may be having other navigation problems.
> Hi.
> We have implemented a URL validation step when we process a response
> to make sure that when people call a page they use the correct URL.
> If they use an incorrect URL, then they are sent a 301 redirect with
> the correct URL.
> The problem with Googlebot is that even though that is the URL we put
> in the sitemap, it doesn't use that URL to make the request - it
> contracts it down to:http://www.domain.com/?whatever=value
> So our server sees this 'incorrect' URL, issues a 301 with the
> 'correct' URL (that has the index.html bit in it), but then Googlebot
> doesn't follow that URL faithfully and again tries to request the URL
> without index.html in the path. So our server again issues a 301
> redirect, with the correct URL and here we go off on our infinite
> loop.
> So no wonder we get the error message:
> URLs not followed.... [sitemap] contained too many redirects.
> I think this is a bug as the 301 redirect clearly sends the redirect
> URL, if Googlebot followed this redirect URL faithfully then we
> wouldn't see this issue.
> I'd appreciate a response from the Google team, Vanessa?
Googlebot has stripped out the index.html - presumably because it
thinks it can get away with not using the default page in the path.
But this is wrong...
Because this is in the context of the sitemap, it isn't going ot
report on errors or problems with urls discovered elsewhere on the
web.
> Googlebot doesn't invent urls. It found that url somewhere. It doesn't
> only crawl urls you have in the sitemap, but also all urls found
> anythwere on the web (including on your site) that appear to be on
> your domain.
> That said, you may be having other navigation problems.
> What's the url?
> On Jul 21, 12:46 pm, edralph888 wrote:
> > Hi.
> > We have implemented a URL validation step when we process a response
> > to make sure that when people call a page they use the correct URL.
> > If they use an incorrect URL, then they are sent a 301 redirect with
> > the correct URL.
> > The problem with Googlebot is that even though that is the URL we put
> > in the sitemap, it doesn't use that URL to make the request - it
> > contracts it down to:http://www.domain.com/?whatever=value
> > So our server sees this 'incorrect' URL, issues a 301 with the
> > 'correct' URL (that has the index.html bit in it), but then Googlebot
> > doesn't follow that URL faithfully and again tries to request the URL
> > without index.html in the path. So our server again issues a 301
> > redirect, with the correct URL and here we go off on our infinite
> > loop.
> > So no wonder we get the error message:
> > URLs not followed.... [sitemap] contained too many redirects.
> > I think this is a bug as the 301 redirect clearly sends the redirect
> > URL, if Googlebot followed this redirect URL faithfully then we
> > wouldn't see this issue.
> > I'd appreciate a response from the Google team, Vanessa?
No. It finds them invented elsewhere, I'm 100% sure of that.
Most of the time it is due to broken navigation on the same site,
coupled with improper server responses.
Other times it's broken links found elsewhere coupled with improper
server response.
> Googlebot has stripped out the index.html - presumably because it
> thinks it can get away with not using the default page in the path.
> But this is wrong...
> Because this is in the context of the sitemap, it isn't going ot
> report on errors or problems with urls discovered elsewhere on the
> web.
> On Jul 21, 6:06 pm, webado wrote:
> > Googlebot doesn't invent urls. It found that url somewhere. It doesn't
> > only crawl urls you have in the sitemap, but also all urls found
> > anythwere on the web (including on your site) that appear to be on
> > your domain.
> > That said, you may be having other navigation problems.
> > What's the url?
> > On Jul 21, 12:46 pm, edralph888 wrote:
> > > Hi.
> > > We have implemented a URL validation step when we process a response
> > > to make sure that when people call a page they use the correct URL.
> > > If they use an incorrect URL, then they are sent a 301 redirect with
> > > the correct URL.
> > > The problem with Googlebot is that even though that is the URL we put
> > > in the sitemap, it doesn't use that URL to make the request - it
> > > contracts it down to:http://www.domain.com/?whatever=value
> > > So our server sees this 'incorrect' URL, issues a 301 with the
> > > 'correct' URL (that has the index.html bit in it), but then Googlebot
> > > doesn't follow that URL faithfully and again tries to request the URL
> > > without index.html in the path. So our server again issues a 301
> > > redirect, with the correct URL and here we go off on our infinite
> > > loop.
> > > So no wonder we get the error message:
> > > URLs not followed.... [sitemap] contained too many redirects.
> > > I think this is a bug as the 301 redirect clearly sends the redirect
> > > URL, if Googlebot followed this redirect URL faithfully then we
> > > wouldn't see this issue.
> > > I'd appreciate a response from the Google team, Vanessa?
"Sitemaps are particularly beneficial when users can't reach all areas
of a website through a browseable interface. (Generally, this is when
users are unable to reach certain pages or regions of a site by
following links). For example, any site where certain pages are only
accessible via a search form would benefit from creating a Sitemap and
submitting it to search engines."
Are you forgetting the fly-by-night mock-directories of spam with
invented links to other sites? Now you see it, now you don't? They
existed just at the time they were crawled by Googlebot?
How about hacked sites with tons of links added? or extra pages which
eventully get cleaned out?
Thanks for the posts guys, but neither of you are addressing the point
of my original post which can be summed up thus:
When following a non-redirected url in the format www.domain.com/index.html?blah=whatever,
Googlebot makes the request to this page by contracting the url to
this: www.domain.com/?blah=whatever. I know that this isn't due to
Googlebot picking up this url from elsewhere (which would have
explained it fine) because the specific error occurs in the sitemap
section of webmaster tools, reporting specifically on what Googlebot
found when crawling sitemap urls.
Now, just assume for a moment that what I said is true, and Googlebot
does omit index.html when crawling sitemap urls.
My webserver is configured to return a 301 with the official URL when
a page is called with the incorrect URL. As Googlebot is requesting
these pages with the index.html omitted, it is an incorrect URL and it
issues a 301 with the official URL containing 'index.html'. But -
this is why the sitemap reports an error - Googlebot proceeds to
follow the 301 redirect by *again* omitting 'index.html' and therefore
we end up in an infinite loop.
> Are you forgetting the fly-by-night mock-directories of spam with
> invented links to other sites? Now you see it, now you don't? They
> existed just at the time they were crawled by Googlebot?
> How about hacked sites with tons of links added? or extra pages which
> eventully get cleaned out?
> On Jul 23, 7:13 am, Phil Payne wrote:
> > On Jul 23, 12:06 pm, webado wrote:
> > > No. It finds them invented elsewhere, I'm 100% sure of that.
> > And once again, Christina - if that were the case they would be found
> > by others elsewhere.
Further to this, there has been a reply from a Googler in another
thread - basically he says that it is 'usually fine' to strip
index.html from requests - but clearly it isn't. I haven't checked
but I'm pretty sure it isn't RFC compliant and it is causing many
people a headache. Please stop stripping index.html!!
> Thanks for the posts guys, but neither of you are addressing the point
> of my original post which can be summed up thus:
> When following a non-redirected url in the formatwww.domain.com/index.html?blah=whatever,
> Googlebot makes the request to this page by contracting the url to
> this:www.domain.com/?blah=whatever. I know that this isn't due to
> Googlebot picking up this url from elsewhere (which would have
> explained it fine) because the specific error occurs in the sitemap
> section of webmaster tools, reporting specifically on what Googlebot
> found when crawling sitemap urls.
> Now, just assume for a moment that what I said is true, and Googlebot
> does omit index.html when crawling sitemap urls.
> My webserver is configured to return a 301 with the official URL when
> a page is called with the incorrect URL. As Googlebot is requesting
> these pages with the index.html omitted, it is an incorrect URL and it
> issues a 301 with the official URL containing 'index.html'. But -
> this is why the sitemap reports an error - Googlebot proceeds to
> follow the 301 redirect by *again* omitting 'index.html' and therefore
> we end up in an infinite loop.
> Does this make sense?
> On Jul 23, 12:19 pm, webado wrote:
> > We can argue on this until the cows come home.
> > Are you forgetting the fly-by-night mock-directories of spam with
> > invented links to other sites? Now you see it, now you don't? They
> > existed just at the time they were crawled by Googlebot?
> > How about hacked sites with tons of links added? or extra pages which
> > eventully get cleaned out?
> > On Jul 23, 7:13 am, Phil Payne wrote:
> > > On Jul 23, 12:06 pm, webado wrote:
> > > > No. It finds them invented elsewhere, I'm 100% sure of that.
> > > And once again, Christina - if that were the case they would be found
> > > by others elsewhere.
It's not Googlebot doing any stripping of anything.
It's your stemap that has them, and/or your server that does it and/
or such links are found elsewhere on the web.
> Further to this, there has been a reply from a Googler in another
> thread - basically he says that it is 'usually fine' to strip
> index.html from requests - but clearly it isn't. I haven't checked
> but I'm pretty sure it isn't RFC compliant and it is causing many
> people a headache. Please stop stripping index.html!!
> > Thanks for the posts guys, but neither of you are addressing the point
> > of my original post which can be summed up thus:
> > When following a non-redirected url in the formatwww.domain.com/index.html?blah=whatever,
> > Googlebot makes the request to this page by contracting the url to
> > this:www.domain.com/?blah=whatever. I know that this isn't due to
> > Googlebot picking up this url from elsewhere (which would have
> > explained it fine) because the specific error occurs in the sitemap
> > section of webmaster tools, reporting specifically on what Googlebot
> > found when crawling sitemap urls.
> > Now, just assume for a moment that what I said is true, and Googlebot
> > does omit index.html when crawling sitemap urls.
> > My webserver is configured to return a 301 with the official URL when
> > a page is called with the incorrect URL. As Googlebot is requesting
> > these pages with the index.html omitted, it is an incorrect URL and it
> > issues a 301 with the official URL containing 'index.html'. But -
> > this is why the sitemap reports an error - Googlebot proceeds to
> > follow the 301 redirect by *again* omitting 'index.html' and therefore
> > we end up in an infinite loop.
> > Does this make sense?
> > On Jul 23, 12:19 pm, webado wrote:
> > > We can argue on this until the cows come home.
> > > Are you forgetting the fly-by-night mock-directories of spam with
> > > invented links to other sites? Now you see it, now you don't? They
> > > existed just at the time they were crawled by Googlebot?
> > > How about hacked sites with tons of links added? or extra pages which
> > > eventully get cleaned out?
> > > On Jul 23, 7:13 am, Phil Payne wrote:
> > > > On Jul 23, 12:06 pm, webado wrote:
> > > > > No. It finds them invented elsewhere, I'm 100% sure of that.
> > > > And once again, Christina - if that were the case they would be found
> > > > by others elsewhere.
I wonder if your implementation of the redirection to root based on
the presence of index.html in the url isn't faulty and it ends up
applying also to cases where you have a query string.
> It's not Googlebot doing any stripping of anything.
> It's your stemap that has them, and/or your server that does it and/
> or such links are found elsewhere on the web.
> Please give the url.
> On Aug 26, 5:36 am, edralph wrote:
> > Further to this, there has been a reply from a Googler in another
> > thread - basically he says that it is 'usually fine' to strip
> > index.html from requests - but clearly it isn't. I haven't checked
> > but I'm pretty sure it isn't RFC compliant and it is causing many
> > people a headache. Please stop stripping index.html!!
> > > Thanks for the posts guys, but neither of you are addressing the point
> > > of my original post which can be summed up thus:
> > > When following a non-redirected url in the formatwww.domain.com/index.html?blah=whatever,
> > > Googlebot makes the request to this page by contracting the url to
> > > this:www.domain.com/?blah=whatever. I know that this isn't due to
> > > Googlebot picking up this url from elsewhere (which would have
> > > explained it fine) because the specific error occurs in the sitemap
> > > section of webmaster tools, reporting specifically on what Googlebot
> > > found when crawling sitemap urls.
> > > Now, just assume for a moment that what I said is true, and Googlebot
> > > does omit index.html when crawling sitemap urls.
> > > My webserver is configured to return a 301 with the official URL when
> > > a page is called with the incorrect URL. As Googlebot is requesting
> > > these pages with the index.html omitted, it is an incorrect URL and it
> > > issues a 301 with the official URL containing 'index.html'. But -
> > > this is why the sitemap reports an error - Googlebot proceeds to
> > > follow the 301 redirect by *again* omitting 'index.html' and therefore
> > > we end up in an infinite loop.
> > > Does this make sense?
> > > On Jul 23, 12:19 pm, webado wrote:
> > > > We can argue on this until the cows come home.
> > > > Are you forgetting the fly-by-night mock-directories of spam with
> > > > invented links to other sites? Now you see it, now you don't? They
> > > > existed just at the time they were crawled by Googlebot?
> > > > How about hacked sites with tons of links added? or extra pages which
> > > > eventully get cleaned out?
> > > > On Jul 23, 7:13 am, Phil Payne wrote:
> > > > > On Jul 23, 12:06 pm, webado wrote:
> > > > > > No. It finds them invented elsewhere, I'm 100% sure of that.
> > > > > And once again, Christina - if that were the case they would be found
> > > > > by others elsewhere.
URLs not followed
When we tested a sample of the URLs from your Sitemap, we found that
some URLs were not accessible to Googlebot because they contained too
many redirects. Please change the URLs in your Sitemap that redirect
and replace them with the destination URL (the redirect target). All
valid URLs will still be submitted
Well guess what: My sitemap already has all the "destination URLs (the
redirect target)" it is google scheduler that changes the URLs and
passes the cripled URLs to the crawler.
webado: please accept it as it is, it IS a google bug, not a
feature :)
Google seams to do a "search and replace" for "index.html" with "".
So I decided to encode the "index.html" substring of my sitemap urls
in the form of: & #105;& #110;& #100;& #101;& #120;& #046;& #104;&
#116;& #109;& #108;
But no luck, I guess that google first decodes the URLs into utf-8 and
then does the buggy string replacement.
Is there a way to avoid the string replacement ? Some info in the xml
file maybee ?
> Google seams to do a "search and replace" for "index.html" with "".
> So I decided to encode the "index.html" substring of my sitemap urls
> in the form of: & #105;& #110;& #100;& #101;& #120;& #046;& #104;&
> #116;& #109;& #108;
> But no luck, I guess that google first decodes the URLs into utf-8 and
> then does the buggy string replacement.
> Is there a way to avoid the string replacement ? Some info in the xml
> file maybee ?
and which, according to the robots.txt directive
Disallow: /HEAD/doc/api/
is disallowed.
You got to agree this is confusing in the extreme.
I haven't checked other situations.
You also have to kee in mind that the sitemap is only an accessory, it
does not define the set of urls belonging to the website. Crawling the
website starting at the root (and/or any other links found around the
web) and follwoing all the links which are not disallowed in
robots.txt or robots meta tags resuls in accumulating athe set of urls
that make up the site. Just because such a url discovered through
crawling is not in the sitemap does not disqualify it. Any url that is
only in the sitemap yet is not found during crawling will only be
added to the official set as long as it's not disallowed anywhere.
With a confusing robots.txt file like yours, with contradictions like
those above, it is to be expected things just won't work as you intend
them to work.
I'm afraid the site structure is rather a total mess and it's not
possible to untangle it using only the robots.txt file, in this case
it tangles it further.
I found a sitemap index at http://www.kdevelop.org/sitemap.xml .
Downloading each individual sitemap from there, sitemap-
translated.xml.gz and sitemap-mediawiki.xml.gz I find that un-zipping
the sitemap from each (using IZArc) results in garbled content. So
I'm pretty sure that that cannot be the sitemap index you submitted,
or you'd have had errors pertaining to inability to process the
ingredient sitemaps themselves.
> Hi Amilcar,
> Can you give the URL of the sitemap that gives
> those redirect messages, or
> add it to your profile?
> Cristina.
> On Sep 21, 1:08 pm, amilcar wrote:
> > Google seams to do a "search and replace" for "index.html" with "".
> > So I decided to encode the "index.html" substring of my sitemap urls
> > in the form of: & #105;& #110;& #100;& #101;& #120;& #046;& #104;&
> > #116;& #109;& #108;
> > But no luck, I guess that google first decodes the URLs into utf-8 and
> > then does the buggy string replacement.
> > Is there a way to avoid the string replacement ? Some info in the xml
> > file maybee ?
If the sitemap with the redirect messages is
http://www.kdevelop.org/ sitemap.xml
(I included an empty space not to be a clickable link)
it is a sitemap index file.
I gunzipped the sitemap files OK,
and the URLs with problems are in
sitemap-translated.xml
with the index.html string in the URLs encoded, as Amilcar wrote.
Amilcar,
You do not need to encode in the sitemap
the characters in index.html,
leave them ASCII, index.html
One suggestion is
rename your sitemap,
remove the current sitemap from Google Webmaster Tools
and resubmit the sitemap with the new name,
just in case the redirect errors refer
to a cached older version of your sitemap.
> > > Google seams to do a "search and replace" for "index.html" with "".
> > > So I decided to encode the "index.html" substring of my sitemap urls
> > > in the form of: & #105;& #110;& #100;& #101;& #120;& #046;& #104;&
> > > #116;& #109;& #108;
> > > But no luck, I guess that google first decodes the URLs into utf-8 and
> > > then does the buggy string replacement.
> > > Is there a way to avoid the string replacement ? Some info in the xml
> > > file maybee ?
Hmm, I was unable to unzip to a readable file. Maybe my IZArc needs
updating.
Anyway, I ran GsiteCrawler on the site with the given robots.txt file
and it found 1487 urls, one broken.
The xml sitemap it created is available for a short time at:
(I added blanks as I don't want robots to keep downlaoding it
inadvertently, it's rather big at 230K).
What's in there is all that a robot that crawls the site and obeys
robots.txt will find and crawl - unless GSC is as confused about your
robots.txt file as I am.
> If the sitemap with the redirect messages ishttp://www.kdevelop.org/sitemap.xml > (I included an empty space not to be a clickable link)
> it is a sitemap index file.
> I gunzipped the sitemap files OK,
> and the URLs with problems are in
> sitemap-translated.xml
> with the index.html string in the URLs encoded, as Amilcar wrote.
> Amilcar,
> You do not need to encode in the sitemap
> the characters in index.html,
> leave them ASCII, index.html
> One suggestion is
> rename your sitemap,
> remove the current sitemap from Google Webmaster Tools
> and resubmit the sitemap with the new name,
> just in case the redirect errors refer
> to a cached older version of your sitemap.
> Cristina.
> On Sep 21, 1:08 pm, amilcar wrote:
> > > > Google seams to do a "search and replace" for "index.html" with "".
> > > > So I decided to encode the "index.html" substring of my sitemap urls
> > > > in the form of: & #105;& #110;& #100;& #101;& #120;& #046;& #104;&
> > > > #116;& #109;& #108;
> > > > But no luck, I guess that google first decodes the URLs into utf-8 and
> > > > then does the buggy string replacement.
> > > > Is there a way to avoid the string replacement ? Some info in the xml
> > > > file maybee ?
If Google Webmaster Tools accepted the sitemap OK
without errors like format error, etc.
then it means that it could gunzip it OK.
I still think that it would be a good idea,
just in case, to re-submit the sitemap
that gives redirect errors
with a new name.
As far as I know Google Webmaster Tools does not
change by itself the name of the URLs listed in the sitemap
(unless the sitemap causes somehow
an unrecognizable parsing error ?)
Hi everyone
In this case it actually is something that we're doing -- we strip "/
index.html" from URLs because that's generally irrelevant and only
makes the URL longer and look more complicated to the user. We do this
when processing the URLs in your Sitemap file so if you *need* to have
"/index.html" in the URLs, they generally won't work like that. At the
moment, there is no solution for using these URLs in Sitemap files if
you need to have "/index.html" in them. I would generally recommend
dropping the "/index.html" part, but I realize that this is sometimes
not easily done.
That said, we will still crawl the website normally, so if those URLs
are reachable through a normal web crawl, we'll still find and index
them normally.
Hope it helps & sorry for not spotting this thread earlier!
I've followed your advice a changed the robots.txt a bit.
The
Allow: /HEAD/doc/api/index.html
was wrong what I meant was
Allow: /HEAD/doc/api/html/index.html
This way no redirects occur
I use allow directives because I only want to allow one or two files
in a directory.
If the crawlers do not support it, then it's ok, as long as they do
not ignore the disallow directives.