MS SQL Database crawl - right number of rows, but no results returned

SBW

unread,

Dec 28, 2009, 2:26:24 PM12/28/09

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

We're just starting to use our GSA's database crawling functionality,
but despite the documentation and a handful of posts here, I can't
seem to get things working correctly.

I've setup a MS SQL-based data source in Crawl and Index > Databases.
Data display/usage: Meta data with Document ID field.

If I run a sync, I get what I believe to be a good sign - no errors
and the right number of rows crawled.

Dec 28, 2009 9:19:00 AM com.google.enterprise.database.TableCrawler
<init>
INFO: Current local time: 2009/12/28 09:19:00 PST
Dec 28, 2009 9:19:00 AM com.google.enterprise.database.Table
getConnection
INFO: connecting to database XXXX on host xxx-yy-xxx-yyy via port zzzz
Dec 28, 2009 9:19:01 AM
com.google.enterprise.database.AbstractTableReader parse
INFO: full crawl
Dec 28, 2009 9:19:01 AM
com.google.enterprise.database.AbstractTableReader parse
INFO: metadata-and-url crawl
168 rows crawled.
0 rows failed.
Dec 28, 2009 9:19:01 AM com.google.enterprise.database.TableCrawler
crawl
INFO: total time = 574

Under Crawl and Index > Follow and Crawl ... I have the following, as
well as the other URLs we're crawling:

^googledb://

Setup a new collection with Include Content Matching the Following
Patterns: as (for simplicity, since getting more specific wasn't
working):

^googledb://

I checked Crawl and Index > Feeds, and while the database shows up
with a status of completed, and again the right number of documents,
there's nothing when I View Feed Data Source Log.

Further, if I go into Status and Reports > Crawl Diagnostics, I see
nothing.

I've also tried the various http://<appliance_ip_address>/db/, but
keep getting 404s (I'm thinking because I don't know the final hash,
but I'm honestly clueless as to whether this will even help).

So I'm stumped. Any suggestions on how I can further debug this? The
connection seems to be working fine, and the query is returning what
fields I want the GSA to search through, so I'm thinking I've missed a
step in the GSA administrative interface.

Version 5.2.0 (we'll be upgrading in a month or so, after some
projects get completed, so 'early' upgrade is probably out, unless
required).

Thanks for any suggestions,

~James

SBW

unread,

Dec 30, 2009, 2:34:35 PM12/30/09

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

In case someone is in the same boat in the future ...

The issue appears to be that I neglected to add the resulting page
(designed in the Base URL field) to "Include Content Matching the
Following Patterns" on the collection itself.

I'm still not sure what the purpose (if any) there is to adding
^googledb://<...> in there as well, per the docs, but ...

Since we're already crawling this content as part of our normal Web
site crawl, I'm actually wondering if there will be overlap that I'm
trying to stay away from ...

~James

JMarkham

unread,

Dec 30, 2009, 4:56:41 PM12/30/09

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Hi James,

So, your fear I believe is correct. If you're already Web crawling
the same content that you're also feeding in via Database feed, then
you may create two document entries.

Is there a reason that you're feeding content by both mechanisms?
Depending what you're doing or need, there may be better solutions.

Jeff

SBW

unread,

Dec 31, 2009, 11:06:16 AM12/31/09

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Thanks for your response Jeff.

Ideally we want users to be able to search a sub-section of products.
Since our URLs are of the same format (/xxx?yyyy&zzz, with x being the
same across all, y being variable and non-unique, and z being the
product id), the only way to determine a sub-section is by checking a
field in the database.

Luckily, it looks like my steps above were basically correct, and the
googledb is still required. Here's the relevant part of the last
response I received from support about this:

===

In the case of a meta data and url crawl, the content will have to be
crawled. The feed is only used to associate the meta data to the
document. Hence you will need to add the final url pattern of the
documents in both the following section of the admin console:

-the Crawl & Index: Follow & Crawl patterns -in any include pattern of
the collections you want to have the content in.

The googledb pattern if only for the feeding part, it is still
necessary to add it to both.
Please keep in mind that for meta data and url feed, the indexing will
happen in three parts:
-the database crawl which happens when you click the sync button.
-the feed processing which happen after the database crawl is
complete.
-once the feed is processed, the appliance will process the documents
through the regular crawl process: download the url content and add
them to the index.

===

The documentation makes it seem like the database crawl is where the
content comes from, not that it creates the links that are actually
crawled, and helps filter what content should be included. Since I was
only adding the googledb part, that's as far as it was able to go.

~James

JMarkham

unread,

Dec 31, 2009, 11:36:28 AM12/31/09

to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

In reality, the metadata and URL feed (which we use here extensively)
provides a feed to the crawler, with extra metadata provided by you.
The crawler then crawls the URL that your feed provided and merges the
indexed data with the database data to produce one record. To see or
interact with the metadata from the database, though, you have to use
special query terms like inmeta, requiredfields or partialfields, OR
you would modify your XSLT to directly display that metadata. One of
our implementations provides avatar photo information for a people
directory, for instance, so pictures are available (the use of the
incremental updates also keeps our user population constantly up-to-
date).

Jeff

Reply all

Reply to author

Forward