Last Modified Meta Tag

4,226 views
Skip to first unread message

MikeHarris

unread,
Aug 4, 2009, 9:19:33 AM8/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi all,


I've done a bit of searching but can't quite find a definitive answer
to my question so I'm hoping someone can help. I've set up my Mini to
look for a meta tag called last-modified and to pull a date out from
it. Initially I had this:

<meta http-equiv="last-modified" content="2009-08-04" />

which worked fine. I looked in the RES->R->FS["VALUE"] field and I get
my date back no problem. But unfortunately my client wanted to add a
time in to the mix. I checked the list of acceptable date formats in
the docs and modified my site code so that the meta tag would be
generated in the following way:

<meta http-equiv="last-modified" content="200908041203" />

However having recrawled the pages the RES->R->FS["VALUE"] field is
now empty. So my question is am I doing something wrong or does the
Mini only support a subset of the accepted formats when pulling them
from a meta tag?

Thanks in advance,


Mike

Joe D'Andrea

unread,
Aug 4, 2009, 9:40:13 AM8/4/09
to Google-Search-...@googlegroups.com
Greetings!

On Tue, Aug 4, 2009 at 9:19 AM, MikeHarris<roundho...@googlemail.com> wrote:

> I checked the list of acceptable date formats in
> the docs and modified my site code so that the meta tag would be
> generated in the following way:
>
> <meta http-equiv="last-modified" content="200908041203" />

That matches "YYYYMMDDHHmm" in the accepted formats list on the latest
Mini revision.

This is an http-equiv, so you would think it istaken "as-if" it was in
the HTTP headers. I don't know for sure but, perhaps the Mini isn't
catching this, and it's _only_ looking at the HTTP headers?

Another thought: I've never known http-equiv to be case sensitive, but
- just in case the Mini is doing something different - you might want
to try "Last-modified", "Last-Modified" ... or perhaps try a new meta
element altogether and add a new rule on the "Document Dates" page to
match.

> However having recrawled the pages the RES->R->FS["VALUE"] field is
> now empty.

Hmm. Unless the docs _on_ the Mini are wrong, I'd think this would work.

Q: When you remove the proxystylesheet parameter from your search URI,
can you find a date _anywhere_ in the XML?

Some notable notes from the Google Mini Help Center:

"For the date extracted from the title, text, URL, or meta tag, the
first instance of the most common date format encountered is
considered the date of the document."

"Use meta tags with dates in the ISO-8601 format (YYYY-MM-DD) to avoid
the confusion caused by multiple dates and multiple formats in the
title or text of the documents."

--
Joe D'Andrea
Liquid Joe LLC | Google Enterprise Partner
www.liquidjoe.biz | skype:joedandrea | +1 (908) 781-0323

mrliamhennessy

unread,
Aug 4, 2009, 12:32:43 PM8/4/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
I had a similar problem when a project I worked on needed to use "sort
by date" and needed accuracy down to a minute.
Apparently the GSA accepts dates in YYYYMMDDHHmm format but *ignores*
the hours and minutes.
We found that the GSA could not sort results within a single day, and
so we had to implement a more complex solution using a database.

There is a related problem which you may encounter whenever you change
your chosen date format.
The following is taken from the docs and you need to read it very
carefully:
http://code.google.com/apis/searchappliance/documentation/60/admin_crawl/Introduction.html


To enable search results to be sorted and presented based on dates,
the Google Search Appliance extracts dates from documents according to
rules configured by the search appliance administrator.

In Google Search Appliance software version 4.4.68 and later, document
dates are extracted from Web pages when the document is indexed.

The search appliance extracts the first date for a document with a
matching URL pattern that fits the date format associated with the
rule. If a date is written in an ambiguous format, the search
appliance assumes that it matches the most common format among URLs
that match each rule for each domain that is crawled. For this
purpose, a domain is one level above the top level. For example,
mycompany.com is a domain, but intranet.mycompany.com is not a domain.

The search appliance periodically runs a process that calculates which
of the supported date formats is the most common for a rule and a
domain. After calculating the statistics for each rule and domain, the
process may modify the dates in the index. The process first runs 12
hours after the search appliance is installed, and thereafter, every
seven days. The process also runs each time you change the document
date rules.

The search appliance will not change which date is most common for a
rule until after the process has run. Regardless of how often the
process runs, the search appliance will not change the date format
more than once a day. The search appliance will not change the date
format unless 5,000 documents have been crawled since the process last
ran.

If you import a configuration file with new document dates after the
process has first run, then you may have to wait at least seven days
for the dates to be extracted correctly. The reason is that the date
formats associated with the new rules are not calculated until the
process runs. If no dates were found the first time the process ran,
then no dates are extracted until the process runs again.

If no date is found, the search appliance indexes the document without
a date.

Normally, document dates appear in search results about 30 minutes
after they are extracted. In larger indexes, the process can several
hours to complete because the process may have to look at the contents
of every document.
Reply all
Reply to author
Forward
0 new messages