First, that link says nothing about file sizes. I think you mean this:
http://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/advanced_topics.html#1076612
Now, to understand these numbers, you have to understand what the
appliance does when it fetches files. The GSA can fetch many types of
files: HTML, PDF, MS Office documents, etc. But it can only index
HTML. The GSA has to download these files, then convert them to HTML,
then index that HTML.
So, there are two important numbers. First, what is the maximum size
of a file that will be downloaded? Second, after that file has been
converted to HTML, how much of that HTML will be indexed?
The Maximum File Sizes to Download under Host Load Schedule controls
the first number. There are actually two numbers here - one for
documents that are already either HTML or text/plain, and another for
other documents like PDFs etc.
The Index Limit under Index Settings controls the second number.
So, on a GSA 7.0 box, the default values for Maximum File Sizes are
20MB for HTML and text, and 100MB for other files if I recall
correctly. Let's say you have a 50MB PDF. The GSA will download that
file, then convert it to HTML. As a part of the conversion, images
will be removed - that will probably significantly reduce the size of
the remainder. If the HTML that results from the conversion is less
than 2.5 MB, all the text will be indexed. If it's larger than 2.5 MB,
only the first 2.5MB will be indexed. You can change the index limit
to up to 10MB to index more text, but that will reduce the number of
documents that can be indexed if you're anywhere near the maximum
storage capacity of your GSA (you probably aren't). It's very rare to
have more than 2.5MB of text anyway, though - that's a lot of text!
Now let's say you have another file: a 110MB PDF. The GSA will ask for
the file, and the web server will reply with an HTTP response
containing a Content-Length response header along with the actual file
in the response body. The GSA will see that the number in
Content-Length exceeds what it will allow, and it will simply ignore
the response. The file will not be indexed at all.
Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/
Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.