Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Discussions > Sitemap Protocol > Sitemap URLs not followed faithfully - bug
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 28 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
edralph888  
View profile  
 More options Jul 21 2008, 12:46 pm
From: edralph888
Date: Mon, 21 Jul 2008 09:46:07 -0700 (PDT)
Local: Mon, Jul 21 2008 12:46 pm
Subject: Sitemap URLs not followed faithfully - bug
Hi.
We have implemented a URL validation step when we process a response
to make sure that when people call a page they use the correct URL.
If they use an incorrect URL, then they are sent a 301 redirect with
the correct URL.

The URL in our sitemap is in the format:
http://www.domain.com/index.html?whatever=value

The problem with Googlebot is that even though that is the URL we put
in the sitemap, it doesn't use that URL to make the request - it
contracts it down to:
http://www.domain.com/?whatever=value

So our server sees this 'incorrect' URL, issues a 301 with the
'correct' URL (that has the index.html bit in it), but then Googlebot
doesn't follow that URL faithfully and again tries to request the URL
without index.html in the path.  So our server again issues a 301
redirect, with the correct URL and here we go off on our infinite
loop.

So no wonder we get the error message:
URLs not followed.... [sitemap] contained too many redirects.

I think this is a bug as the 301 redirect clearly sends the redirect
URL, if Googlebot followed this redirect URL faithfully then we
wouldn't see this issue.

I'd appreciate a response from the Google team, Vanessa?

Thanks
Ed


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 21 2008, 1:06 pm
From: webado
Date: Mon, 21 Jul 2008 10:06:34 -0700 (PDT)
Local: Mon, Jul 21 2008 1:06 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
Googlebot doesn't invent urls. It found that url somewhere. It doesn't
only crawl urls you have in the sitemap, but also all urls found
anythwere on the web (including on your site)  that appear to be on
your domain.

That said, you may be having other navigation problems.

What's the url?

On Jul 21, 12:46 pm, edralph888 wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
edralph888  
View profile  
 More options Jul 22 2008, 9:35 am
From: edralph888
Date: Tue, 22 Jul 2008 06:35:15 -0700 (PDT)
Local: Tues, Jul 22 2008 9:35 am
Subject: Re: Sitemap URLs not followed faithfully - bug
I have to disagree.

Our sitemap has all the urls programatically created in the format:

http://www.domain.com/index.html?param=whatever

However, the sitemap 'errors and warnings' shows this:

HTTP Error:
Found: 301 (Moved permanently)

http://www.domain.com/?param=whatever1
http://www.domain.com/?param=whatever2
http://www.domain.com/?param=whatever3
http://www.domain.com/?param=whatever4
http://www.domain.com/?param=whatever5
Jul 20, 2008

Googlebot has stripped out the index.html - presumably because it
thinks it can get away with not using the default page in the path.
But this is wrong...
Because this is in the context of the sitemap, it isn't going ot
report on errors or problems with urls discovered elsewhere on the
web.

On Jul 21, 6:06 pm, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Phil Payne  
View profile  
 More options Jul 23 2008, 6:14 am
From: Phil Payne
Date: Wed, 23 Jul 2008 03:14:51 -0700 (PDT)
Local: Wed, Jul 23 2008 6:14 am
Subject: Re: Sitemap URLs not followed faithfully - bug
On Jul 21, 6:06 pm, webado wrote:

> Googlebot doesn't invent urls.

Oh, yes - it bloody does.  Not in this case, however.

The key thing is that you cannot redirect a URL that appears in a
sitemap.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 23 2008, 7:06 am
From: webado
Date: Wed, 23 Jul 2008 04:06:21 -0700 (PDT)
Local: Wed, Jul 23 2008 7:06 am
Subject: Re: Sitemap URLs not followed faithfully - bug
No. It finds them invented elsewhere, I'm 100% sure of that.
Most of the time it  is due to  broken navigation on the same site,
coupled with improper server responses.
Other times it's broken links found elsewhere coupled with improper
server response.

On Jul 23, 6:14 am, Phil Payne wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 23 2008, 7:07 am
From: webado
Date: Wed, 23 Jul 2008 04:07:39 -0700 (PDT)
Local: Wed, Jul 23 2008 7:07 am
Subject: Re: Sitemap URLs not followed faithfully - bug
It's NOT the sitemap that decides what gets crawled and indexed. It's
the site itself, its navigation and robots directives.

On Jul 22, 9:35 am, edralph888 wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Phil Payne  
View profile  
 More options Jul 23 2008, 7:13 am
From: Phil Payne
Date: Wed, 23 Jul 2008 04:13:04 -0700 (PDT)
Local: Wed, Jul 23 2008 7:13 am
Subject: Re: Sitemap URLs not followed faithfully - bug
On Jul 23, 12:06 pm, webado wrote:

> No. It finds them invented elsewhere, I'm 100% sure of that.

And once again, Christina - if that were the case they would be found
by others elsewhere.

But they NEVER are.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Phil Payne  
View profile  
 More options Jul 23 2008, 7:16 am
From: Phil Payne
Date: Wed, 23 Jul 2008 04:16:25 -0700 (PDT)
Local: Wed, Jul 23 2008 7:16 am
Subject: Re: Sitemap URLs not followed faithfully - bug
On Jul 23, 12:07 pm, webado wrote:

> It's NOT the sitemap that decides what gets crawled and indexed. It's
> the site itself, its navigation and robots directives.

https://www.google.com/webmasters/tools/docs/en/protocol.html

"Sitemaps are particularly beneficial when users can't reach all areas
of a website through a browseable interface. (Generally, this is when
users are unable to reach certain pages or regions of a site by
following links). For example, any site where certain pages are only
accessible via a search form would benefit from creating a Sitemap and
submitting it to search engines."

That's what Google says.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Jul 23 2008, 7:19 am
From: webado
Date: Wed, 23 Jul 2008 04:19:47 -0700 (PDT)
Local: Wed, Jul 23 2008 7:19 am
Subject: Re: Sitemap URLs not followed faithfully - bug
We can argue on this until the cows come home.

Are you forgetting the fly-by-night mock-directories of spam with
invented links to other sites? Now you see it, now you don't? They
existed just at the time they were crawled by Googlebot?

How about hacked sites with tons of links added? or extra pages which
eventully get cleaned out?

On Jul 23, 7:13 am, Phil Payne wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
edralph  
View profile  
 More options Jul 25 2008, 5:12 am
From: edralph
Date: Fri, 25 Jul 2008 02:12:16 -0700 (PDT)
Local: Fri, Jul 25 2008 5:12 am
Subject: Re: Sitemap URLs not followed faithfully - bug
Thanks for the posts guys, but neither of you are addressing the point
of my original post which can be summed up thus:

When following a non-redirected url in the format www.domain.com/index.html?blah=whatever,
Googlebot makes the request to this page by contracting the url to
this: www.domain.com/?blah=whatever.  I know that this isn't due to
Googlebot picking up this url from elsewhere (which would have
explained it fine) because the specific error occurs in the sitemap
section of webmaster tools, reporting specifically on what Googlebot
found when crawling sitemap urls.

Now, just assume for a moment that what I said is true, and Googlebot
does omit index.html when crawling sitemap urls.

My webserver is configured to return a 301 with the official URL when
a page is called with the incorrect URL.  As Googlebot is requesting
these pages with the index.html omitted, it is an incorrect URL and it
issues a 301 with the official URL containing 'index.html'.  But -
this is why the sitemap reports an error - Googlebot proceeds to
follow the 301 redirect by *again* omitting 'index.html' and therefore
we end up in an infinite loop.

Does this make sense?

On Jul 23, 12:19 pm, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
edralph  
View profile  
 More options Aug 26 2008, 5:36 am
From: edralph
Date: Tue, 26 Aug 2008 02:36:27 -0700 (PDT)
Local: Tues, Aug 26 2008 5:36 am
Subject: Re: Sitemap URLs not followed faithfully - bug
Further to this, there has been a reply from a Googler in another
thread - basically he says that it is 'usually fine' to strip
index.html from requests - but clearly it isn't.  I haven't checked
but I'm pretty sure it isn't RFC compliant and it is causing many
people a headache.  Please stop stripping index.html!!

http://groups.google.com/group/Google_Webmaster_Help-Sitemap/browse_t...

On Jul 25, 10:12 am, edralph wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Aug 26 2008, 7:21 am
From: webado
Date: Tue, 26 Aug 2008 04:21:25 -0700 (PDT)
Local: Tues, Aug 26 2008 7:21 am
Subject: Re: Sitemap URLs not followed faithfully - bug
It's not Googlebot doing any stripping of anything.
It's your stemap that has them, and/or your server that  does it and/
or such links are found elsewhere on the web.

Please give the url.

On Aug 26, 5:36 am, edralph wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Aug 26 2008, 7:28 am
From: webado
Date: Tue, 26 Aug 2008 04:28:25 -0700 (PDT)
Local: Tues, Aug 26 2008 7:28 am
Subject: Re: Sitemap URLs not followed faithfully - bug
On a properly built website it is not only fine but recommended to
strip index.html (or index.php or any other index ) from the url of
the homepage ONLY. In other words when there are no query string
paramaters involved. That's when  http://www.example.com/ has the same
content as http://www.example.com/index.html  or http://www.example.com/index.htm
or http://www.example.com/index.php or http://www.example.com/index.asp
or whatever. Of course you may do it also when there are query strings
attached, but in that case you need to be consistent and always do
it.

I wonder if your implementation of the redirection to root based on
the presence of index.html in the url isn't faulty and it ends up
applying also to cases where you have a query string.

You need to provide the url in any case.

On Aug 26, 7:21 am, webado wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
amilcar  
View profile  
 More options Aug 31 2008, 8:35 pm
From: amilcar
Date: Sun, 31 Aug 2008 17:35:25 -0700 (PDT)
Local: Sun, Aug 31 2008 8:35 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
This is clearly a buggy implementation of google.

I do use all follow all their rules:

Unique homepage entry point without index.html
http://www.kdevelop.org/

Content pages with unique URLs :
http://www.kdevelop.org/index.html?filename=.....

Redirects (301) to enforce the URL scheme described above.

The sitemap also uses all these rules.

Nevertheless the google scheduler strips the "index.html" part and the
crawler then error out with messages about "too many redirects".

I've reported this thing back in March 2008. But the issue is that
nobody understands that this is a bug.

All those clever heads working at google, and none recognizes this
bug... it's a pity.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
amilcar  
View profile  
 More options Aug 31 2008, 8:38 pm
From: amilcar
Date: Sun, 31 Aug 2008 17:38:05 -0700 (PDT)
Local: Sun, Aug 31 2008 8:38 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
The coolest thing is that my site-ranking also dropped after
submitting my CORRRECT sitemap.
Thanks :)

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
amilcar  
View profile  
 More options Aug 31 2008, 8:45 pm
From: amilcar
Date: Sun, 31 Aug 2008 17:45:39 -0700 (PDT)
Local: Sun, Aug 31 2008 8:45 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
This is warning I get:

URLs not followed
When we tested a sample of the URLs from your Sitemap, we found that
some URLs were not accessible to Googlebot because they contained too
many redirects. Please change the URLs in your Sitemap that redirect
and replace them with the destination URL (the redirect target). All
valid URLs will still be submitted

Well guess what: My sitemap already has all the "destination URLs (the
redirect target)" it is google scheduler that changes the URLs and
passes the cripled URLs to the crawler.

webado: please accept it as it is, it IS a google bug, not a
feature :)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
amilcar  
View profile  
 More options Sep 2 2008, 12:11 pm
From: amilcar
Date: Tue, 2 Sep 2008 09:11:28 -0700 (PDT)
Local: Tues, Sep 2 2008 12:11 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
Bump this thread up in a hope to get a response.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
amilcar  
View profile  
 More options Sep 21 2008, 8:08 am
From: amilcar
Date: Sun, 21 Sep 2008 05:08:32 -0700 (PDT)
Local: Sun, Sep 21 2008 8:08 am
Subject: Re: Sitemap URLs not followed faithfully - bug
Google seams to do a "search and replace" for "index.html" with "".
So I decided to encode the "index.html" substring of my sitemap urls
in the form of: & #105;& #110;& #100;& #101;& #120;& #046;& #104;&
#116;& #109;& #108;

But no luck, I guess that google first decodes the URLs into utf-8 and
then does the buggy string replacement.

Is there a way to avoid the string replacement ? Some info in the xml
file maybee ?

Thanks
Amilcar


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
cristina  
View profile  
 More options Sep 21 2008, 11:20 am
From: cristina
Date: Sun, 21 Sep 2008 08:20:43 -0700 (PDT)
Local: Sun, Sep 21 2008 11:20 am
Subject: Re: Sitemap URLs not followed faithfully - bug
Hi Amilcar,
Can you give the URL of the sitemap that gives
those redirect messages, or
add it to your profile?

Cristina.

On Sep 21, 1:08 pm, amilcar wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Sep 21 2008, 12:06 pm
From: webado
Date: Sun, 21 Sep 2008 09:06:21 -0700 (PDT)
Local: Sun, Sep 21 2008 12:06 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
I see the robtos.txt is invalid and at least to me confusing.

You have the sitemap line above the robots directioves. It shoudl be
at the very bottom separated by at lest one blak line fro the last
directive.

The you have  things like:

Disallow: /phorum5/
Allow:    /phorum5/index.php
Allow:    /phorum5/read.php

and

Disallow: /HEAD/doc/api/
Allow:    /HEAD/doc/api/index.html

and other similar things.  I can believe that confuses the bejesus out
of robots.

You are aware that Allow is not part of the robots protocol, it is
only understood by Googlebot (maybe other robots as well, but not
definitively).

But both Disallow and Allow are prefix driven.

Looks what happens here:
http://web-sniffer.net/?url=http%3A%2F%2Fwww.kdevelop.org%2FHEAD%2Fdo...

Trying to access
http://www.kdevelop.org/HEAD/doc/api/index.html

which is presumably allowed, results in a meta refresh redirection to
html/ which translates to
http://www.kdevelop.org/HEAD/doc/api/html/

and which, according to the robots.txt directive
Disallow: /HEAD/doc/api/

is disallowed.

You got to agree this is confusing in the extreme.

I haven't checked other situations.

You also have to kee in mind that the sitemap is only an accessory, it
does not define the set of urls belonging to the website. Crawling the
website starting at the root (and/or any other links found around the
web) and follwoing all the links which are not disallowed in
robots.txt or robots meta tags resuls in accumulating athe set of urls
that make up the site. Just because such a url discovered through
crawling is not in the sitemap does not disqualify it. Any url that is
only in the sitemap yet is not found during crawling will only be
added to the official set as long as it's not disallowed anywhere.

With a confusing robots.txt file like yours, with contradictions like
those above, it is to be expected things just won't work as you intend
them to work.

I'm afraid the site structure is rather a total mess and it's not
possible to untangle it using only the robots.txt file, in this case
it tangles it further.

I found a sitemap index at http://www.kdevelop.org/sitemap.xml .
Downloading each individual sitemap from there, sitemap-
translated.xml.gz and sitemap-mediawiki.xml.gz I find that un-zipping
the sitemap from each (using IZArc) results in garbled content.  So
I'm pretty sure that that cannot be the sitemap index you submitted,
or you'd have had errors pertaining to inability to process the
ingredient sitemaps themselves.

On Sep 21, 11:20 am, cristina wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
cristina  
View profile  
 More options Sep 21 2008, 12:45 pm
From: cristina
Date: Sun, 21 Sep 2008 09:45:24 -0700 (PDT)
Local: Sun, Sep 21 2008 12:45 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
If the sitemap with the redirect messages is
http://www.kdevelop.org/ sitemap.xml
(I included an empty space not to be a clickable link)
it is a sitemap index file.
I gunzipped the sitemap files OK,
and the URLs with problems are in
sitemap-translated.xml
with the index.html string in the URLs encoded, as Amilcar wrote.

Amilcar,
You do not need to encode in the sitemap
the characters in index.html,
leave them ASCII, index.html

One suggestion is
rename your sitemap,
remove the current sitemap from Google Webmaster Tools
and resubmit the sitemap with the new name,
just in case the redirect errors refer
to a cached older version of your sitemap.

Cristina.

On Sep 21, 1:08 pm, amilcar wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
webado  
View profile  
 More options Sep 21 2008, 1:23 pm
From: webado
Date: Sun, 21 Sep 2008 10:23:00 -0700 (PDT)
Subject: Re: Sitemap URLs not followed faithfully - bug
Hmm, I was unable to unzip to a readable file. Maybe my IZArc needs
updating.

Anyway, I ran GsiteCrawler on the site with the given robots.txt file
and it found 1487 urls, one broken.
The xml sitemap it created is available for a short time at:

http://  widget.webado.com/kdevelop/kdevelop-sitemap.xml

More legible  than the .gz version, mind you.

(I added blanks as I don't want robots to keep downlaoding it
inadvertently, it's rather big at 230K).

What's in there is all that a robot that crawls the site and obeys
robots.txt will find and crawl - unless GSC is as confused about your
robots.txt file as I am.

On Sep 21, 12:45 pm, cristina wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
cristina  
View profile  
 More options Sep 21 2008, 1:54 pm
From: cristina
Date: Sun, 21 Sep 2008 10:54:36 -0700 (PDT)
Local: Sun, Sep 21 2008 1:54 pm
Subject: Re: Sitemap URLs not followed faithfully - bug
If Google Webmaster Tools accepted the sitemap OK
without errors like format error, etc.
then it means that it could gunzip it OK.

I still think that it would be a good idea,
just in case, to re-submit the sitemap
that gives redirect errors
with a new name.

As far as I know Google Webmaster Tools does not
change by itself the name of the URLs listed in the sitemap
(unless the sitemap causes somehow
an unrecognizable parsing error ?)

Cristina.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
JohnMu Google employee  
View profile  
 More options Sep 22 2008, 6:17 am
From: JohnMu
Date: Mon, 22 Sep 2008 03:17:51 -0700 (PDT)
Local: Mon, Sep 22 2008 6:17 am
Subject: Re: Sitemap URLs not followed faithfully - bug
Hi everyone
In this case it actually is something that we're doing -- we strip "/
index.html" from URLs because that's generally irrelevant and only
makes the URL longer and look more complicated to the user. We do this
when processing the URLs in your Sitemap file so if you *need* to have
"/index.html" in the URLs, they generally won't work like that. At the
moment, there is no solution for using these URLs in Sitemap files if
you need to have "/index.html" in them. I would generally recommend
dropping the "/index.html" part, but I realize that this is sometimes
not easily done.

That said, we will still crawl the website normally, so if those URLs
are reachable through a normal web crawl, we'll still find and index
them normally.

Hope it helps & sorry for not spotting this thread earlier!

John


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
amilcar  
View profile  
 More options Sep 26 2008, 7:47 am
From: amilcar
Date: Fri, 26 Sep 2008 04:47:56 -0700 (PDT)
Local: Fri, Sep 26 2008 7:47 am
Subject: Re: Sitemap URLs not followed faithfully - bug
Thanks webado.

I've followed your advice a changed the robots.txt a bit.
The
Allow:    /HEAD/doc/api/index.html
was wrong what I meant was
Allow:    /HEAD/doc/api/html/index.html

This way no redirects occur

I use allow directives because I only want to allow one or two files
in a directory.
If the crawlers do not support it, then it's ok, as long as they do
not ignore the disallow directives.

Thanks,
Amilcar


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 28   Newer >
« Back to Discussions « Newer topic     Older topic »