Wrong <lastmod> date

131 views
Skip to first unread message

Preben Nielsen

unread,
Feb 21, 2013, 11:35:19 AM2/21/13
to gsitec...@googlegroups.com
I am using the latest version of GSiteCrawler. All pages are categorized as "daily" in <changefreq>. This is because the <lastmod> in the created sitemap is the date and time for crawling by GSiteCrawler, not the date and time on the server.
In Settings I have enabled "Include date last modified of the URL according to the server." What can be the problem then?

/Preben

webado

unread,
Feb 21, 2013, 11:58:12 AM2/21/13
to gsitec...@googlegroups.com
This means your server does not communicate the last modified header for each url.
--
You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsitecrawler...@googlegroups.com.
To post to this group, send email to gsitec...@googlegroups.com.
Visit this group at http://groups.google.com/group/gsitecrawler?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
www.webado.net
Webhosting and Design

Preben Nielsen

unread,
Feb 21, 2013, 5:09:12 PM2/21/13
to gsitec...@googlegroups.com
Yes. And I found out that it is because my pages are now .php and not .htm. I had to add a piece of code to the pages to make it work:
$mod = gmdate("D, d M Y H:i:s ",getlastmod())."GMT";
header("Last-Modified: $mod");

The date/time is now correct. The <changefreq> still says daily for all pages. I suppose that will change as time go?

/Preben

webado

unread,
Feb 21, 2013, 8:08:29 PM2/21/13
to gsitec...@googlegroups.com
You can change your own frequency and priority, it doesn't change on its own.

Preben Nielsen

unread,
Feb 22, 2013, 12:24:35 PM2/22/13
to gsitec...@googlegroups.com
Do I remember wrongly that after a certain number of days "daily" changed automatically to "weekly", and the same with "monthly"? I don't remember ever manually to have changed these things. Priority yes, of cource, but not daily, weekly and monthly.

Preben Nielsen

unread,
Feb 22, 2013, 3:41:38 PM2/22/13
to gsitec...@googlegroups.com
In the URL list in the programme a few of the Freq. has actually changed without my interference. But these files are all old .pdf and .xls files. All urls of .php-files have the Freq. 1.
So Freq. changes automatically but for some reason not for my .php files. I don't know if the fact that the urls of all php-files have no extension can influence here.

webado

unread,
Feb 22, 2013, 4:18:37 PM2/22/13
to gsitec...@googlegroups.com
The server has to respond with that lastmod. Natively this does not happen for php.


On Friday, February 22, 2013, Preben Nielsen wrote:
In the URL list in the programme a few of the Freq. has actually changed without my interference. But these files are all old .pdf and .xls files. All urls of .php-files have the Freq. 1.
So Freq. changes automatically but for some reason not for my .php files. I don't know if the fact that the urls of all php-files have no extension can influence here.

--
You received this message because you are subscribed to the Google Groups "SOFTplus GSiteCrawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gsitecrawler...@googlegroups.com.
To post to this group, send email to gsitec...@googlegroups.com.
Visit this group at http://groups.google.com/group/gsitecrawler?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Preben Nielsen

unread,
Feb 23, 2013, 5:24:23 AM2/23/13
to gsitec...@googlegroups.com
Could you go a bit more into details with this - I do not understand.

After adding the above-mentioned piece of code, the server actually does communicate the <lastmod> date correctly, which can be seen in the generated sitemap.xml. But the date does not influence the <changefreq> (which is always 1 in the sitemap.xml) which I suppose is the same thing as one can see in the software's Freq. under URL list, where all URLs referring to php-files have the Freq. 1

webado

unread,
Feb 23, 2013, 9:24:49 AM2/23/13
to gsitec...@googlegroups.com
Kind of what I said.
Unless you specifically add code to send through a header that communicates the last_modified date, GSiteCrawler assumes today's date, as does Googlelbot. because it receives no header with the last_modified date in requests to php files. That's just how the server responds for non-static files, because anything that executes php scripting has the potential of including a lot of content that is dynamically generated. So natively a php script does not respond with a last_modified date. You have to force it by adding that header with a last_modified date that you either set manually to a specific date, or compute as an aggregate of last_modified dates from each of the major components that are included in that php script and the script itself.

So now that you added it the lastmod will be set to what you tell it to be in the script that adds that header. .

GSitecrawler will keep the originally set frequency value unless you actually delete all the links from the previously discovered URL list  and start over with a fresh crawl. Then For urls that com trough with a lastmod date set in the past it will determine some frequency other than daily. But I'd not rely on that and always set it manually for each url or group of urls.

Preben Nielsen

unread,
Feb 24, 2013, 5:58:12 AM2/24/13
to gsitec...@googlegroups.com
Thank your for your detailed answer.
You are right: After deleting the non-manual links in the URL list frequencies turned up. Apart from urls which are database-generated. They still show "1". That could prabably be fixed too. But before trying to find out how to do that I'm asking myself how useful it is in relation to how the pages are treated by Google? For the moment I feel tempted to simply leave out both <lastmod> and <changefreq>. For the last 3 1/2 months (after changing from .htm to .php pages) <lastmod> has been "1" apparently without any disadvantage, as pages have been indexed properly and quickly all the time within this period.

webado

unread,
Feb 24, 2013, 8:06:26 AM2/24/13
to gsitec...@googlegroups.com
In theory at least the frequency provides a hint to Google how often a
particular page is expected to change. Whether it makes any use of that
or not I couldn't tell you. Maybe not because it is so often misleading
- especially if you see a lastmod date way in the past with a frequency
of daily, that's not quite helpful.
An accurate lastmod can certainly help Googlebot "decide" what is worth
re-crawling. No point recrwling something that's not been modified since
the last time it was crawled. But always setting the lastmod to today
when it's not been actually modified doesn't help either because after a
couple of rounds of crawling Googlebot figures it out as being
unreliable and ignores it altogether.

If unable to provide accurate lastmod and frequency values you might
as well leave them out. If they are left out for all urls the benefit of
using an xml sitemap as opposed to a plain text list of urls disappear.

Christina

Preben Nielsen

unread,
Feb 24, 2013, 9:47:08 AM2/24/13
to gsitec...@googlegroups.com
Weel, in the xml-file there's still the <priority>, which is not in the text-list with just urls. Though, who knows if it's usefull...
As my pages stay pretty much the same once published, I presume that in my case the most important function of the sitemap is to tell google that new pages are added hopefully with the effect of getting them indexed quicker. So I don't mind leaving it up to them when they want to recrawl. If I want a page recrawled after major changes I can ask for this in the goole webmastertools by resending the page to the index.

Thanks again.

/Preben

webado

unread,
Feb 24, 2013, 10:05:22 AM2/24/13
to gsitec...@googlegroups.com
The priority parameter is yet another one that's not likely to be used,
also because it's not reliable. When people set everything to 1 or 0.9
or something like that it's safe to completely ignore it.

Christina

Preben Nielsen

unread,
Feb 24, 2013, 10:58:59 AM2/24/13
to gsitec...@googlegroups.com
You're probably right.
I've set about 30 (out of 2250) urls to between 0,6 and 0,9, and just the baseurl to 1,0. The rest I've left to 0,5. I guess it can't do any harm, as it is a serious attempt to give a priority of the page's importance.

/Preben

webado

unread,
Feb 24, 2013, 11:27:56 AM2/24/13
to gsitec...@googlegroups.com
That's fine.

I set mien to 1.0, .75, .5 and .25 more or less according to my
assessment of importance.
If I have pages that are almost as important as the homepage they will
be 0.9 .

Mind you nothing says the homepage is really the most important page, it
depends on the site.

Christina

Preben Nielsen

unread,
Feb 24, 2013, 5:38:54 PM2/24/13
to gsitec...@googlegroups.com
Good point with the homepage. I'll consider: Should I lower the priority for that page.

I have just 2 pages set to 0.9 (pages linking to the rest of the content), 3 set to 0.8 and the rest below. If everything is important, nothing is! Setting all urls to 1 is - what we in danish would express as - shooting oneself in the foot.

/Preben

webado

unread,
Feb 24, 2013, 5:49:01 PM2/24/13
to gsitec...@googlegroups.com
One page should still have a priority of 1. It's up to you which one you
consider your most important page.
For example, if the homepage is just a splash page, asking you to click
to enter the site, obviously that would not be the most important page.

And yes, all priorities set to 1 I believe is a good representation
of shooting oneself in the foot :) Luckily Google doesn't seem to take
the priority into consideration - perhaps when it doesn't appear to make
sense.

Christina

Preben Nielsen

unread,
Mar 1, 2013, 4:15:05 PM3/1/13
to gsitec...@googlegroups.com
Right. Until I might come up with something more clever I'll stick to the homepage set to 1, as it always carries information about topical content and leads to it.
Thanks for great help and conversation about relevant issues connected to my question.

/Preben
Reply all
Reply to author
Forward
0 new messages