Malformed HTTP header: empty content while crawling MediaWiki

34 views
Skip to first unread message

James Cook

unread,
Mar 11, 2009, 3:16:59 PM3/11/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
I just upgraded a GSA 1001 to software 5.2 and recrawled our document
set. We use MediaWiki internally for documentation. I notice that
approximately 1/3 of the URLs the GSA attempts to crawl show up in
crawl diagnostics as "Error: Malformed HTTP header: empty content".
The other 2/3 are crawled correctly.

We use HTTP auth for this site and I have the GSA providing my login
credentials for pattern https://wiki.lindenlab.com/

Typical web server logs show (for an affected page):
wiki.lindenlab.com 64.154.223.153 - - [11/Mar/2009:11:57:43 -0700]
"GET /wiki/Agent_Inventory_Service/test_plan HTTP/1.0" 401 540 "-"
"gsa-crawler (Enterprise; S5-MWT8F5K6T2NAA; ja...@lindenlab.com)" -
wiki.lindenlab.com 64.154.223.153 - james [11/Mar/2009:11:57:43 -0700]
"GET /wiki/Agent_Inventory_Service/test_plan HTTP/1.0" 200 5618 "-"
"gsa-crawler (Enterprise; S5-MWT8F5K6T2NAA; ja...@lindenlab.com)" -

Loading the page with curl shows normal results.

I believe the second status "200" is returning 5618 bytes? The
document should be 22KB in size.

I do not know if this is a new behavior with software 5.2, or if I
just didn't notice it before with 5.0. Has anyone seen this issue?

James
ja...@lindenlab.com

James Cook

unread,
Mar 11, 2009, 5:26:50 PM3/11/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
Note: This problem appears specific to software 5.2. I tried crawling
the same MediaWiki with 5.0 on another GSA and I do not see this
problem.

Unfortunately, my attempt to revert my 5.2 upgrade failed and I appear
to be stuck with it. Does anyone know if there's a way to downgrade
5.2 to 5.0 after the initial upgrade?

James

James Cook

unread,
Mar 12, 2009, 12:49:31 AM3/12/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini
The smaller than expected document size is probably because the
documents are gzipped. Still no clues on this end.

James

miguev

unread,
Mar 12, 2009, 6:00:58 AM3/12/09
to Google Search Appliance/Google Mini - Google Search Appliance/Google Mini

Hi James,

There is no way to downgrade from 5.2 to 5.0 but you can make it work
anyway. You have actually spotted the problem: gzip compression.
There is an issue where the appliance says it accepts gzip encoding
but it doesn't really accept it, so you end up having empty pages.
Try adding a header "Accept-Encoding: foo" and in the box under Crawl
and Index > HTTP Headers
http://code.google.com/apis/searchappliance/documentation/52/help_gsa/crawl_headers.html
Reply all
Reply to author
Forward
0 new messages