I was under the distinct impression that the GSA will not even attempt
to index files with types NOT listed in this document "Indexable File
Formats".
http://code.google.com/apis/searchappliance/documentation/62/reference/formats.html
However, not only did the GSA index all sorts of files, it did it
successfully? (ini, sym, mdb etc etc) and many many more.
Any advice would be helpful.
Thank you,
Richard
GB 1001 Software Version: 6.2.0.G.14 (Active)
We are crawling smb file shares. In case it matters, my pattern is as
follows:
Follow and Crawl:
regexpIgnoreCase:smb://myComputer\\.myDomain\\.com/KOLFiles/
(UserResumes|Attachments)/$
regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
Attachments)/bin[0-9]{3}/$
regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
Attachments)/bin[0-9]{3}/[0-9A-F]{8}\\-[0-9A-F]{4}\\-[0-9A-F]{4}\\-
[0-9A-F]{4}\\-[0-9A-F]{12}\\.[0-9A-Za-z]+$
You can use exclusion rules to filter out specific types of files.
Terry Chambers
Sr. Systems Engineer
Onix Networking Corp.
http://www.onixnet.com/
On Apr 6, 10:15 am, digitalQ <rich...@digitalq.com> wrote:
> Hello -
>
> I was under the distinct impression that the GSA will not even attempt
> to index files with types NOT listed in this document "Indexable File
> Formats".http://code.google.com/apis/searchappliance/documentation/62/referenc...
Thank you very much for the quick reply.
I didn't realize that the mime-type will override the file extension,
I don't remember any of the documentation noting that, or maybe I
missed it. Google docs specifically state, "The Google Search
Appliance cannot crawl, index, or search any file formats that are not
listed. " and I couldn't find ANYTHING about the mime-type.
How sure are you of this?
You are also correct, I am crawling about 160,000 documents in smb
file shares. When the GSA successfully crawls unwanted files, it
counts against our document license count and even though the file was
"successful", the end result is not that valuable or desirable to have
in our index.
Is there anyway to "white list" file extensions? I understand I can
black list extensions by including line items such as regexpIgnoreCase:
\\.ini$ in the Do Not Crawl settings, but this is not best because I
will never explicitly know all file types in these dynamic folders. In
our case, a white list of known file types would be best.
Anyway to do this without creating a big messy reg exp in follow
patterns.
Thanks again for your reply.
~Richard
On Apr 6, 7:22 am, onixterry <te...@onixnet.com> wrote:
> The appliance indexes files based upon the mime-type of the content.
> If it recognizes the type of file it is, it will include it. For
> example, an "ini" file is likely a text file. This is more of an
> issue for file shares as opposed to web sites since these types of
> files are not typically linked to from a web site.
>
> You can use exclusion rules to filter out specific types of files.
>
> Terry Chambers
> Sr. Systems Engineer
> Onix Networking Corp.http://www.onixnet.com/
>
> On Apr 6, 10:15 am, digitalQ <rich...@digitalq.com> wrote:
>
>
>
> > Hello -
>
> > I was under the distinct impression that the GSA will not even attempt
> > to index files with types NOT listed in this document "Indexable File
> > Formats".http://code.google.com/apis/searchappliance/documentation/62/referenc...
>
> > However, not only did the GSA index all sorts of files, it did it
> > successfully? (ini, sym, mdb etc etc) and many many more.
>
> > Any advice would be helpful.
>
> > Thank you,
> > Richard
>
> > GB 1001 Software Version: 6.2.0.G.14 (Active)
>
> > We are crawling smb file shares. In case it matters, my pattern is as
> > follows:
>
> > Follow and Crawl:
> > regexpIgnoreCase:smb://myComputer\\.myDomain\\.com/KOLFiles/
> > (UserResumes|Attachments)/$
> > regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
> > Attachments)/bin[0-9]{3}/$
> > regexp:smb://myComputer\\.myDomain\\.com/KOLFiles/(UserResumes|
> > Attachments)/bin[0-9]{3}/[0-9A-F]{8}\\-[0-9A-F]{4}\\-[0-9A-F]{4}\\-
> > [0-9A-F]{4}\\-[0-9A-F]{12}\\.[0-9A-Za-z]+$- Hide quoted text -
>
> - Show quoted text -
If you crawl a "file share" site that contains things such as user
backups or contains images of installed software, you will pick up a
lot of unwanted content.
Unless your content is organized in a very structured way, I cannot
think of any way to handle this other than through the black list of
file extensions.
Terry
I think I'm going to run this by Enterprise Support before I start
creating a messy black list of extensions.
Thanks for the help.
> > > - Show quoted text -- Hide quoted text -
Your impression is incorrect. The GSA has specific converters for the
listed file formats. But it will index any file not specifically
excluded in the "Do Not Crawl" box.
Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/
Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.
Thanks,
~Richard
On Apr 6, 12:36 pm, Dave Watts <dwa...@figleaf.com> wrote:
> > I was under the distinct impression that the GSA will not even attempt
> > to index files with types NOT listed in this document "Indexable File
> > Formats".
> >http://code.google.com/apis/searchappliance/documentation/62/referenc...
>
> > However, not only did the GSA index all sorts of files, it did it
> > successfully? (ini, sym, mdb etc etc) and many many more.
>
> Your impression is incorrect. The GSA has specific converters for the
> listed file formats. But it will index any file not specifically
> excluded in the "Do Not Crawl" box.
>
> Dave Watts, CTO, Fig Leaf Softwarehttp://www.figleaf.com/http://training.figleaf.com/
Just a $.02 addition... It would seem a better solution to white list
the file types you want in the Follow and Crawl, rather than what
would seem to be a rather involved effort to black list what you don't
want. You won't be able to use content statistics as a reference for
file types, either, since that's MIME types, not extensions (as noted,
a .ini file is text MIME type, as one for instance).
Jeff