application/octet-stream

1,306 views
Skip to first unread message

damalefer

unread,
Dec 1, 2009, 11:43:39 PM12/1/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Hi everybody!!

I have a question: what does it mean the "application/octet-stream"
term on "Content Statistics"?? What kind of file types groups that
item?

I want to exclude or minimize the elements included on that group

Do you have any idea?

Best regards

damalefer

unread,
Dec 2, 2009, 3:41:57 PM12/2/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Any idea for this topic? :(

Joe D'Andrea

unread,
Dec 2, 2009, 3:52:37 PM12/2/09
to google-search-...@googlegroups.com
Greetings!

On Wed, Dec 2, 2009 at 3:41 PM, damalefer <dama...@gmail.com> wrote:

> Any idea for this topic? :(

application/octet-stream usually refers to binary files (like
executables and so on).

On one hand, you could just check to see if any exe files (and the
like) are being crawled. They shouldn't be, but you never know. Add
any extensions you don't want crawled to the Do Not Crawl patterns in
your config.

However, this notion of "octet-stream" is ultimately indicated by the
web server. Your web server already has these associations set up
(like text/plain for text files, text/html for web pages, etc.).
There's also a _default_ association though. So, if you have your
default file type (MIME type) as "application/octet-stream" on your
web server config, there might be other file extensions that you are
just being declared octet-stream by default.

Sometimes this is a good thing. Sometimes … not so much. It all
depends on context.

I'd take inventory of your file extensions first and see if anything
jumps out that way. Then decide to block or not. Also check your web
server settings for any default MIME/content types and see if
application/octet-stream is set. (Again, you may want to keep it.
Depends on why it was set up that way. Just something to look out
for.)

--
Joe D'Andrea | Liquid Joe LLC
Google Enterprise Consultant | iPhone Application Developer
www.liquidjoe.biz | skype:joedandrea | +1 (908) 781-0323

damalefer

unread,
Dec 2, 2009, 4:23:16 PM12/2/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Thanks Joe, thanks a lot!

This is the idea: I want to crawl only .DOC and .PDF files in my file
server.

If I join Application/octet-stream, text/plain and text/html the
number is 8667 items. I prefer that this amount is used to crawl word
and acrobat files (NO OTHER FILE TYPE)

Thanks a lot again!

On Dec 2, 5:52 pm, "Joe D'Andrea" <jdand...@gmail.com> wrote:
> Greetings!
>
> Google Enterprise Consultant | iPhone Application Developerwww.liquidjoe.biz| skype:joedandrea |              +1 (908) 781-0323        +1 (908) 781-0323

Joe D'Andrea

unread,
Dec 3, 2009, 8:19:44 AM12/3/09
to google-search-...@googlegroups.com
On Wed, Dec 2, 2009 at 4:23 PM, damalefer <dama...@gmail.com> wrote:

> This is the idea: I want to crawl only .DOC and .PDF files in my file
> server.

First, a URL patterns primer:
http://code.google.com/apis/searchappliance/documentation/60/admin/URL_patterns.html

Next, some thoughts:

If your "Follow and Crawl" rules are really simple/minimal, like this:

^http://www.mysite.com/

You could change that to something like this:

regexp:http://www\\.mysite\\.com/.+\\.(pdf|doc)$

This matches on your site (http scheme), then one or more characters
(.+), all ending in .pdf or .doc. (Adjust/strengthen to taste.)

This follows the "that which is not expressly permitted is denied"
school of thought. So now you know you're ONLY going to get pdfs or
docs in your crawl/index.

The other option is to exclude anything that _isn't_ a PDF or DOC in
the "Do Not Crawl" list (a 'negative assertion') … if there was a way
to do that on the GSA. I don't think the metacharacters for that are
available in the GSA implementation.

regexp:\\.?!(pdf|doc)$

The "?!" is a "negative lookahead" assertion in regexp parlance.

Then again, the aforementioned permitted/denied rule is a good
practice to follow in general, so hopefully that does the trick!

--
Joe D'Andrea | Liquid Joe LLC
Google Enterprise Consultant | iPhone Application Developer
www.liquidjoe.biz | skype:joedandrea | +1 (908) 781-0323
Reply all
Reply to author
Forward
0 new messages