REG: Crawling issue while Feed xml Submission

4 views
Skip to first unread message

Rajesh

unread,
Feb 9, 2012, 2:27:59 AM2/9/12
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi All,
We are crawling a huge amount of multimedia documents (pdf,
excel, word) through feed xml submission to the appliance. In the feed
xml, we define our own meta tags and pushed it to the document and
submit for the crawling. The following is the feed xml we are
submitting for crawling of the documents.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>testFeed</datasource>
<feedtype>metadata-and-url</feedtype>
</header>
<group>
<record url="http://multimedia.com/mws/mediawebserver?
mwsId=SSSSSu7zK1fslxtUNxmBPx_9ev7qe17zHvTSevTSeSSSSSS--
&amp;fn=70-2009-7555-8.pdf" action="add" mimetype="application/pdf">
<metadata>
<meta name="powTheme" content="en_US_Medical_WebSalesPortal"/>
<meta name="srty_lvl_code" content="1"/>
<meta name="Description" content="Essential Solutions: Hard to
Dress Wounds, FAQs, Skin Health, pdf 70-2009-7555-8"/>
</metadata>
</record>
</group>
</gsafeed>

Once the feed submission is done and completed successfully,
we have checked the crawl status of those documents and it were not
shown up in the crawl diagnostics. Those documents were also not shown
up in the results.
But when we manually crawled those urls (by adding to
followup crawl list and then in Freshness tunning), then those urls
were shown up in the crawl diagnostics and then it was also shown up
in the results.
We have tried the same with few more documents, and then it
were crawled only after the manual crawl not after the feed
submission.
Why those documents were not crawled up once the feed
submission is done? Do we want manually trigger the crawl for all
those documents submitted through feed xml?
If so, how could i trigger for such large number of documents?
Is there any easy way to do that?

Please suggest/help me on this.

Thanks,
Rajesh

Dave Watts

unread,
Feb 9, 2012, 10:23:37 AM2/9/12
to google-search-...@googlegroups.com
>        Once the feed submission is done and completed successfully,
> we have checked the crawl status of those documents and it were not
> shown up in the crawl diagnostics. Those documents were also not shown
> up in the results.
>        But when we manually crawled those urls (by adding to
> followup  crawl list and then in Freshness tunning), then those urls
> were shown up in the crawl diagnostics and then it was also shown up
> in the results.
>        We have tried the same with few more documents, and then it
> were crawled only after the manual crawl not after the feed
> submission.

When you submit a metadata-and-URL feed, the appliance doesn't
immediately crawl the documents in the feed. It adds the URLs to the
crawl queue, and will get around to crawling them in due time. If the
appliance already knew about those URLs by discovering them during a
crawl, it won't recrawl them at all as a result of the feed, but will
instead recrawl them according to the previously determined crawl
schedule.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.

Rajesh

unread,
Feb 14, 2012, 11:01:26 AM2/14/12
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi Dave,
We have checked the crawl diagnostics only after all the urls
in the the crawl queue became zero. Then also those documents which
were crawled by feed xml were not showing up in the crawl diagnostics.
Will it take some more days or later? Why it was so?

Thanks,
Rajesh

Dave Watts

unread,
Feb 14, 2012, 12:24:39 PM2/14/12
to google-search-...@googlegroups.com

It's hard to say how long it should take, but generally it shouldn't
take more than a day or so, and usually it should take even less than
that. It all depends on how busy your appliance is, really.

What do you see in the Feeds screen?

When you look at Crawl Diagnostics, are you looking at the default collection?

Rajesh

unread,
Feb 17, 2012, 4:47:33 AM2/17/12
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi Dave,
We have checked the crawl diagnostics after a few days only and
also looked into the correct collection only. In the Feed screen, it
says the feed is completed successfully. But when we checked the crawl
status of the url (which is in the feed xml), it says its not crawled.
We have found one more thing in the feed xml submission. We are
submitting the same multimedia document with different url and meta
data in different feed. For example, we are submitting the below two
urls in different feed with different meta data.
"http://multimedia.com/mws/mediawebserver?
mwsId=SSSSSu7zK1fslxtUNxmBPx_9ev7qe17zHvTSevTSeSSSSSS--
&amp;fn=70-2009-7555-8.pdf"
"http://multimedia.com/mws/mediawebserver?
mwsId=SSSSSu7zK1fslxtUNxmBPx_9ev7qe17zHvTSevTSeSSSSSS--
&amp;fn=70-2009-7555-8.pdf&feed=dmr"

The above two urls leads to same document. After the submitting the
two feeds, we saw that only the second one getting crawled (this feed
is submitted first) and the another url is not crawled at all which is
the issue here.
So is there any issues in crawling two different urls which
belongs to same content?

If that is the problem, what can we do?

Please help me on this.
Reply all
Reply to author
Forward
0 new messages