I just upgraded a GSA 1001 to software 5.2 and recrawled our document
set. We use MediaWiki internally for documentation. I notice that
approximately 1/3 of the URLs the GSA attempts to crawl show up in
crawl diagnostics as "Error: Malformed HTTP header: empty content".
The other 2/3 are crawled correctly.
We use HTTP auth for this site and I have the GSA providing my login
credentials for pattern
https://wiki.lindenlab.com/
Typical web server logs show (for an affected page):
wiki.lindenlab.com 64.154.223.153 - - [11/Mar/2009:11:57:43 -0700]
"GET /wiki/Agent_Inventory_Service/test_plan HTTP/1.0" 401 540 "-"
"gsa-crawler (Enterprise; S5-MWT8F5K6T2NAA;
ja...@lindenlab.com)" -
wiki.lindenlab.com 64.154.223.153 - james [11/Mar/2009:11:57:43 -0700]
"GET /wiki/Agent_Inventory_Service/test_plan HTTP/1.0" 200 5618 "-"
"gsa-crawler (Enterprise; S5-MWT8F5K6T2NAA;
ja...@lindenlab.com)" -
Loading the page with curl shows normal results.
I believe the second status "200" is returning 5618 bytes? The
document should be 22KB in size.
I do not know if this is a new behavior with software 5.2, or if I
just didn't notice it before with 5.0. Has anyone seen this issue?
James
ja...@lindenlab.com