I am working on a couple of library projects in which metadata
including hyperlinks are aggregated from a variety of sources and
republished via a portal. The websites which the hyperlinks are drawn
from are not all crawler-friendly, so most of these resources are not
known to Google at all. Unfortunately, the portal software is not
crawler-friendly either, so I'm tasked with investigating exposing the
aggregated metadata to Google via OAI-PMH.
I have an OAI-PMH server which I have registered with Google Sitemaps,
and in particular I need to disclose URLs with different domain names
than the OAI-PMH server itself, because these links have been
aggregated from a variety of sources. The issue is that Google
Sitemaps rejects any URL whose domain was different to the OAI-PMH
server's domain.
Why is this? This restriction does not apply to URLs harvested by
spidering the web itself, so why should sitemaps be any different?
Anyway ... since I'd been expecting this behaviour, I also tested a
work-around: I used my OAI-PMH server to disclose a URL with the same
domain as the OAI-PMH server, but which used an HTTP redirection to
point to a location on a different ("foreign") domain. This URL was
accepted by Google Sitemaps, and I'm now waiting to tell if the
"foreign" content itself is eventually accepted by Googlebot and
indexed.
If this simple work-around succeeds, I wonder what is the point in
rejecting "foreign" links from OAI-PMH in the first place?
On the other hand, if the work-around fails, and the redirected URL is
rejected by Googlebot (because of using a redirect?), then how is it
possible to use Google Sitemaps to disclose "foreign" links to Google?
Would I have to re-publish all my hyperlinks in the form of HTML? That
would seem silly to me, since all the technical efficiencies of OAI-
PMH would then be lost.
Regards
Con