how did you decide which mini to buy?

1 view
Skip to first unread message

mtraney

unread,
Dec 29, 2009, 2:24:59 PM12/29/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Our university has multiple servers and, frankly, tons of junk stored
on the web servers that is not actually part of the site. How did you
all that bought the mini figure out how many documents are really
being crawled? And I assume that includes all the linked html pages as
well as the other types of text documents (pdf, doc etc) but NOT
images? Thanks.

brianb

unread,
Jan 5, 2010, 12:45:48 AM1/5/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Yeah, a lot of people end up with tons more pages than they expect but
it is possible to limit the crawl to specific hosts etc if you were to
go over your license limit. The best approach would be to get a rough
guess from each webmaster and start from there. The Minis can handle
up to 300K documents each and the licenses are upgradeable from 50K to
300K documents so you can always adjust as you go. You can also limit
the crawl to only html, only pdf or whichever you please.

Not sure if this is the exact answer you are looking for but hope it
helps.

Brian

Edward

unread,
Jan 5, 2010, 9:45:11 AM1/5/10
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Assuming you have someone with root access, you could script a
recursive find and output all the contents to a log file. You can then
cat your log file and grep -V to remove unwanted file types. You can
then export chunks from your log file and work on removing unwanted
directories. Your site uses an asp approach sporatically, so you may
have to combine asp pages with the html documents to make this
approach work.

I'm not sure if this feature is available on the Mini.. but I do a
regexp to only retrieve content from the site relative root folder to
three folders deep. A similar approach may work?

There are also free web crawler applications on the web. I won't name
them because they are not all Google products.. but it would hopefully
help you get a number of documents. Besides, its that Google algorithm
that everyone seems to want...

Just ideas that may be useful.

Reply all
Reply to author
Forward
0 new messages