Can GSA crawl XML sitemaps?

85 views
Skip to first unread message

frankadelic

unread,
Dec 28, 2009, 1:46:30 PM12/28/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Can Google Search Appliance crawl and index XML sitemaps built using
the standard Sitemap format?

http://www.sitemaps.org/protocol.php

Or is it better to use HTML jump page(s) for GSA to index?

To give some context, I have a site with about 70,000 pages that need
to be indexed. Because of the use of Ajax, many of the URLs are not
directly reachable by GSA.

Also, I am running GSA v6.

Thanks

JMarkham

unread,
Dec 30, 2009, 10:02:36 AM12/30/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Greetings,

The GSA cannot crawl the content of XML directly, it can only index
XML documents themselves. Your two options are an HTML jump page, or
turn your XML sitemap into a Feed. Information on XML feeds is here:
http://code.google.com/apis/searchappliance/documentation/60/feedsguide.html#system

If you're on version 6.2, then replace the /60/ in the URL with /62/.

Hope that helps,

Jeff

angel

unread,
Jun 12, 2013, 2:37:06 AM6/12/13
to Google-Search-...@googlegroups.com
Hi,

If we include a sitemap.xml in robots.txt the can the gsa crawler understand it ?

Regards
Angel
Message has been deleted

Dave Watts

unread,
Jun 13, 2013, 1:44:01 PM6/13/13
to Google-Search-...@googlegroups.com
> If we include a sitemap.xml in robots.txt the can the gsa crawler understand
> it ?

I don't think the GSA can consume sitemaps very well - which is kind
of odd, since it can create them for you. And, I don't think it'll
follow anything you put in robots.txt.

You could point the GSA directly to the sitemap, but as a rule the GSA
doesn't handle XML files very well, because it doesn't have a standard
way to identify hyperlinks or other contextual information - XML files
are just raw data. And it's almost certainly not going to do anything
with the other information in your sitemap.

If you're using an automated process to build the sitemap, though, you
might be able to use this same process to build a simple HTML page
with all the links, and give that to the GSA.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.

angel

unread,
Jun 18, 2013, 12:55:14 AM6/18/13
to Google-Search-...@googlegroups.com
Hi Dave,

Does the gsa crawler understand the crawl delay  parameter in the robots.txt file?
Is there any other way to reduce crawler traffic ?

Regards
Angel

angel

unread,
Jun 18, 2013, 1:02:41 AM6/18/13
to Google-Search-...@googlegroups.com
Hi ,

One more question if the gsa supports the crawl delay parameter ,then in order to reduce the traffic which is a better option...
Crawl delay or host load schedule...


Regards
Angel

Dave Watts

unread,
Jun 20, 2013, 9:55:21 AM6/20/13
to Google-Search-...@googlegroups.com
> Does the gsa crawler understand the crawl delay parameter in the robots.txt
> file?

I don't know.

> Is there any other way to reduce crawler traffic ?

Yes, you can use Freshness Tuning or Host Load Schedule.

VFPT

unread,
Jun 28, 2013, 2:19:17 PM6/28/13
to Google-Search-...@googlegroups.com
I would recommend to use html version of sitemap.xml, with robots metadata tag to "follow" the urls and noindex" to the html page. Just add in the "start crawling from this url list" in GSA.
Reply all
Reply to author
Forward
0 new messages