GSA Indexing File Types It Should NOT?

9 views
Skip to first unread message

digitalQ

unread,
Apr 6, 2010, 10:15:19 AM4/6/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hello -

I was under the distinct impression that the GSA will not even attempt
to index files with types NOT listed in this document "Indexable File
Formats".
http://code.google.com/apis/searchappliance/documentation/62/reference/formats.html

However, not only did the GSA index all sorts of files, it did it
successfully? (ini, sym, mdb etc etc) and many many more.

Any advice would be helpful.

Thank you,
Richard

GB 1001 Software Version: 6.2.0.G.14 (Active)

We are crawling smb file shares. In case it matters, my pattern is as
follows:

Follow and Crawl:
regexpIgnoreCase:smb://myComputer\\.myDomain\\.com/KOLFiles/
(UserResumes|Attachments)/$
regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
Attachments)/bin[0-9]{3}/$
regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
Attachments)/bin[0-9]{3}/[0-9A-F]{8}\\-[0-9A-F]{4}\\-[0-9A-F]{4}\\-
[0-9A-F]{4}\\-[0-9A-F]{12}\\.[0-9A-Za-z]+$

onixterry

unread,
Apr 6, 2010, 10:22:40 AM4/6/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
The appliance indexes files based upon the mime-type of the content.
If it recognizes the type of file it is, it will include it. For
example, an "ini" file is likely a text file. This is more of an
issue for file shares as opposed to web sites since these types of
files are not typically linked to from a web site.

You can use exclusion rules to filter out specific types of files.

Terry Chambers
Sr. Systems Engineer
Onix Networking Corp.
http://www.onixnet.com/


On Apr 6, 10:15 am, digitalQ <rich...@digitalq.com> wrote:
> Hello -
>
> I was under the distinct impression that the GSA will not even attempt
> to index files with types NOT listed in this document "Indexable File

> Formats".http://code.google.com/apis/searchappliance/documentation/62/referenc...

digitalQ

unread,
Apr 6, 2010, 10:36:30 AM4/6/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Terry,

Thank you very much for the quick reply.

I didn't realize that the mime-type will override the file extension,
I don't remember any of the documentation noting that, or maybe I
missed it. Google docs specifically state, "The Google Search
Appliance cannot crawl, index, or search any file formats that are not
listed. " and I couldn't find ANYTHING about the mime-type.

How sure are you of this?

You are also correct, I am crawling about 160,000 documents in smb
file shares. When the GSA successfully crawls unwanted files, it
counts against our document license count and even though the file was
"successful", the end result is not that valuable or desirable to have
in our index.

Is there anyway to "white list" file extensions? I understand I can
black list extensions by including line items such as regexpIgnoreCase:
\\.ini$ in the Do Not Crawl settings, but this is not best because I
will never explicitly know all file types in these dynamic folders. In
our case, a white list of known file types would be best.

Anyway to do this without creating a big messy reg exp in follow
patterns.

Thanks again for your reply.

~Richard


On Apr 6, 7:22 am, onixterry <te...@onixnet.com> wrote:
> The appliance indexes files based upon the mime-type of the content.
> If it recognizes the type of file it is, it will include it.  For
> example, an "ini" file is likely a text file.  This is more of an
> issue for file shares as opposed to web sites since these types of
> files are not typically linked to from a web site.
>
> You can use exclusion rules to filter out specific types of files.
>
> Terry Chambers
> Sr. Systems Engineer

> Onix Networking Corp.http://www.onixnet.com/


>
> On Apr 6, 10:15 am, digitalQ <rich...@digitalq.com> wrote:
>
>
>
> > Hello -
>
> > I was under the distinct impression that the GSA will not even attempt
> > to index files with types NOT listed in this document "Indexable File
> > Formats".http://code.google.com/apis/searchappliance/documentation/62/referenc...
>
> > However, not only did the GSA index all sorts of files, it did it
> > successfully? (ini, sym, mdb etc etc) and many many more.
>
> > Any advice would be helpful.
>
> > Thank you,
> > Richard
>
> > GB 1001 Software Version: 6.2.0.G.14 (Active)
>
> > We are crawling smb file shares. In case it matters, my pattern is as
> > follows:
>
> > Follow and Crawl:
> > regexpIgnoreCase:smb://myComputer\\.myDomain\\.com/KOLFiles/
> > (UserResumes|Attachments)/$
> > regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
> > Attachments)/bin[0-9]{3}/$
> > regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
> > Attachments)/bin[0-9]{3}/[0-9A-F]{8}\\-[0-9A-F]{4}\\-[0-9A-F]{4}\\-

> > [0-9A-F]{4}\\-[0-9A-F]{12}\\.[0-9A-Za-z]+$- Hide quoted text -
>
> - Show quoted text -

onixterry

unread,
Apr 6, 2010, 10:43:14 AM4/6/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
I am not 100% sure but this is based upon my observations. Based upon
how it works, the crawler seems to analyze the files and include them.

If you crawl a "file share" site that contains things such as user
backups or contains images of installed software, you will pick up a
lot of unwanted content.

Unless your content is organized in a very structured way, I cannot
think of any way to handle this other than through the black list of
file extensions.

Terry

digitalQ

unread,
Apr 6, 2010, 10:53:12 AM4/6/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Okay - thanks Terry.

I think I'm going to run this by Enterprise Support before I start
creating a messy black list of extensions.

Thanks for the help.

> > > - Show quoted text -- Hide quoted text -

Dave Watts

unread,
Apr 6, 2010, 3:36:31 PM4/6/10
to google-search-...@googlegroups.com
> I was under the distinct impression that the GSA will not even attempt
> to index files with types NOT listed in this document "Indexable File
> Formats".
> http://code.google.com/apis/searchappliance/documentation/62/reference/formats.html
>
> However, not only did the GSA index all sorts of files, it did it
> successfully? (ini, sym, mdb etc etc) and many many more.

Your impression is incorrect. The GSA has specific converters for the
listed file formats. But it will index any file not specifically
excluded in the "Do Not Crawl" box.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.

digitalQ

unread,
Apr 6, 2010, 4:04:08 PM4/6/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi Dave - Yes, you and Terry are both correct, Enterprise Support
confirms this as well.

Thanks,
~Richard

On Apr 6, 12:36 pm, Dave Watts <dwa...@figleaf.com> wrote:
> > I was under the distinct impression that the GSA will not even attempt
> > to index files with types NOT listed in this document "Indexable File
> > Formats".

> >http://code.google.com/apis/searchappliance/documentation/62/referenc...


>
> > However, not only did the GSA index all sorts of files, it did it
> > successfully? (ini, sym, mdb etc etc) and many many more.
>
> Your impression is incorrect. The GSA has specific converters for the
> listed file formats. But it will index any file not specifically
> excluded in the "Do Not Crawl" box.
>

> Dave Watts, CTO, Fig Leaf Softwarehttp://www.figleaf.com/http://training.figleaf.com/

JMarkham

unread,
Apr 6, 2010, 4:39:21 PM4/6/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi,

Just a $.02 addition... It would seem a better solution to white list
the file types you want in the Follow and Crawl, rather than what
would seem to be a rather involved effort to black list what you don't
want. You won't be able to use content statistics as a reference for
file types, either, since that's MIME types, not extensions (as noted,
a .ini file is text MIME type, as one for instance).

Jeff

Reply all
Reply to author
Forward
0 new messages