Indexing file size limit

249 views
Skip to first unread message

toul...@gmail.com

unread,
Feb 7, 2014, 8:23:00 AM2/7/14
to Google-Search-...@googlegroups.com
Could anyone please provide information on the file size limit for full text indexing? Is it 30MB or 10MB? (In http://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/advanced_topics.html#1076167 found that its 10 MB and in some other documentation it is given as 30 MB)

Also, please clarify how both these settings are different:
Crawl and Index > Index Settings
Crawl and Index > Host Load Schedule 

Thanks!

Mathias Bierl

unread,
Feb 7, 2014, 8:32:40 AM2/7/14
to Google-Search-...@googlegroups.com
The setting under Host Load schedule is the maximal file size which will be downloaded.
The setting under Index settings is the maximum size of text which will be indexed

Dave Watts

unread,
Feb 7, 2014, 9:41:37 AM2/7/14
to Google-Search-...@googlegroups.com
First, that link says nothing about file sizes. I think you mean this:

http://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/advanced_topics.html#1076612

Now, to understand these numbers, you have to understand what the
appliance does when it fetches files. The GSA can fetch many types of
files: HTML, PDF, MS Office documents, etc. But it can only index
HTML. The GSA has to download these files, then convert them to HTML,
then index that HTML.

So, there are two important numbers. First, what is the maximum size
of a file that will be downloaded? Second, after that file has been
converted to HTML, how much of that HTML will be indexed?

The Maximum File Sizes to Download under Host Load Schedule controls
the first number. There are actually two numbers here - one for
documents that are already either HTML or text/plain, and another for
other documents like PDFs etc.

The Index Limit under Index Settings controls the second number.

So, on a GSA 7.0 box, the default values for Maximum File Sizes are
20MB for HTML and text, and 100MB for other files if I recall
correctly. Let's say you have a 50MB PDF. The GSA will download that
file, then convert it to HTML. As a part of the conversion, images
will be removed - that will probably significantly reduce the size of
the remainder. If the HTML that results from the conversion is less
than 2.5 MB, all the text will be indexed. If it's larger than 2.5 MB,
only the first 2.5MB will be indexed. You can change the index limit
to up to 10MB to index more text, but that will reduce the number of
documents that can be indexed if you're anywhere near the maximum
storage capacity of your GSA (you probably aren't). It's very rare to
have more than 2.5MB of text anyway, though - that's a lot of text!

Now let's say you have another file: a 110MB PDF. The GSA will ask for
the file, and the web server will reply with an HTTP response
containing a Content-Length response header along with the actual file
in the response body. The GSA will see that the number in
Content-Length exceeds what it will allow, and it will simply ignore
the response. The file will not be indexed at all.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.

toul...@gmail.com

unread,
Feb 13, 2014, 5:14:33 AM2/13/14
to Google-Search-...@googlegroups.com
Thanks a lot Dave for the valuable information..
Is there any option to split the large files into chunks for full text indexing? We have the requirement to index a large number of files with greater size.
Also, for the files which are skipped, will it index the title and metadata? Please advise. 

Jeremy Garreau

unread,
Feb 13, 2014, 7:51:14 AM2/13/14
to Google-Search-...@googlegroups.com
There is this pretty documentation now, you can find all the limitation here :

Dave Watts

unread,
Feb 13, 2014, 11:16:25 AM2/13/14
to Google-Search-...@googlegroups.com
> Is there any option to split the large files into chunks for full text
> indexing? We have the requirement to index a large number of files with
> greater size.

The GSA isn't going to split large files for you. There's nothing
stopping you from doing that yourself. But those numbers can be
changed, so you might not have to split files at all. If you did have
any files which exceeded either of the two limits' maximum values, you
could build a content feed to put them into the GSA.

And, as I mentioned previously, since the GSA only indexes text, the
actual size of the text is all that matters. Very large books fall
well within the limits of the GSA's default settings. If you take all
the text out of your 110 MB PDF, it's likely to be less than 1MB.
Everything else is just graphics and formatting.

> Also, for the files which are skipped, will it index the title and metadata?

No, it will not.

Dave Watts, CTO, Fig Leaf Software
1-202-527-9569

Pablo Solera

unread,
Feb 14, 2014, 3:33:41 AM2/14/14
to Google-Search-...@googlegroups.com
Just FYI, as it is just slightly related to the question.
Long time ago I came across this additional limitation:
Phrase searches work only for the first 300 KB of an indexed document.
It is in the documentation but not in the new Specifications and Usage Limits.


Ivar Ekman

unread,
Feb 15, 2014, 2:32:09 AM2/15/14
to Google-Search-...@googlegroups.com
The latest GSA documentation has a specific place for documentating the limitations. See https://support.google.com/gsa/answer/4411411

-Ivar

toul...@gmail.com

unread,
Feb 16, 2014, 11:40:54 PM2/16/14
to Google-Search-...@googlegroups.com
Thanks a lot Dave for the clarification..

Nagaraj Jayaraman

unread,
Feb 24, 2014, 2:22:22 PM2/24/14
to Google-Search-...@googlegroups.com
Thanks a lot Dave for explaining this clearly.
Reply all
Reply to author
Forward
0 new messages