> "Retrying URL: Host unreachable while trying to fetch robots.txt" this
> error is a pain in the ass. I am suffering from the same error when
> trying to crawl through a web site.
One thing I've noticed with robots.txt: It's important that the web
server provide a definitive status in response - preferably a 200 or
404. (AuthN/AuthZ notwithstanding.)
I have encountered many a server where this is _not_ the case. After
adjustment, the "Host unreachable while trying to fetch robots.txt"
error goes away.
--
Joe D'Andrea
Liquid Joe LLC
www.liquidjoe.biz
+1 (908) 781-0323
> "Return Code 401, should be 200"
Ahh! Now, where are you spotting that status? (In your
browser/user-agent, in the GSA Diags, elsewhere?)
> So it's something with the authentication. I added permission for my
> crawling account to the robots.txt, which didn't make a difference.
To clarify - do you mean you added gsa-crawler to robots.txt? Or
something else ... ?
> Do you know where else I might need to add permission?
Normally I would direct you to "Crawl and Index > Crawler Access" but
... let's double-check the connector docs:
http://snurl.com/743rm [code_google_com]
If you want to go the quick-and-dirty route (or perhaps just as a
bonus reality check), can you force IIS to serve just robots.txt w/o
_any_ AuthN? This may not be a best practice, but in a pinch it might
do the trick.
Ah-ha!
"The Google Search Appliance cannot crawl the content unless a
robots.txt file is present in the SharePoint site's root directory.
Ensure that you create a robots.txt file and ensure that the file is
public."
So you _do_ need to make robots.txt public, at least if I understand
the above correctly.
> Something else just occurred to me. I made my crawler account a local
> administrator yesterday, and Administrators had permission to that
> file (and probably all files), which is why it didn't make a
> difference when I set permissions (again) on the file for my crawler
> account. Something else is up.
I think you're on to something. Also from the doc:
"The Microsoft SharePoint connector and the Google Search Appliance
require user credentials for traversal and indexing. Google recommends
that you use a single user account for both."
To your question about how to get a 200 or 404 response for robots.txt
... perhaps try this?
http://www.starznet.co.uk/sharepoint/blog/RobotstxtinSharePoint.htm
--
Joe D'Andrea
Good - that's where you want it - at the docroot (even if you start
your crawl at a lower level).
> How do I make robots.txt public?
See my previous msg (we just missed each other) - that might do the trick!
--
Joe D'Andrea
> I did not reset IIS last night, so I don't know if that would resolve
> this, but I just noticed under my Crawl Diagnostics for the Sharepoint
> URL, at 10:36 a.m. yesterday (I guess this was the next recrawl after
> enabling anonymous access), there's an "Info: Redirected URL" and then
> "Excluded: Authentication Failed."...
Progress! Sort of. :)
> ... well, at least the "Retrying URL:... robots.txt" message is gone, but it seems
> like a step in the wrong direction. :P
At least now you're arriving at a more definitive end result.
> Did I enable anonymous access incorrectly? Should I just assign
> 'Everyone' Read permission?
Looks like robots.txt is still being blocked by AuthN. So long as you
set the robots.txt permissions to be anonymous read, the next step[1]
is to try an IISReset. THEN, try to get robots.txt using your web
browser - don't even bother with the GSA for this test. If you get the
file without a login (note: make sure you aren't already logged in),
then I think you're set!
- Joe
Arrrgh. OK. Reality check Q: Without being logged in from the browser,
you can reach robots.txt from the docroot - that is:
and not:
http://my.site/some/directory/robots.txt
What else ... does the GSA show any change in crawl diagnostics with
respect to robots.txt?
- Joe
> After the IISReset, I tried logging out of Sharepoint (since it
> automatically logs you in according to your computer/network login)
> and I was able to access it.
Ahh, we need to try it from a location that isn't logged in (or can't
get automatically logged in).
It sounds like you did that though:
> I tried it from the
> server, with the enhanced browser security, which asked me to log in.
> I hit cancel, (got the "You are not authorized to view this page" 401
> error), and changed the end of the URL to robots.txt.)
Ahh, good. At least you can get to it now, which means the GSA should
also be able to.
> Nothing. I see one retrieval error on <servername> until I drill down
> to servername\mainsite, at which point I see "Excluded: Authentication
> Failed"...
So we still have an authentication issue. :\
> I tried the URL test on Network Settings, and http://serverFQDN/robots.txt
> came through OK. The short URL came through invalid (not surprising,
> but I was just testing) ...
Auugh! I know - this is frustrating.
I'm crossing fingers that we find out it's something trivial we both
missed, and we're both going to virtually smack our foreheads. (From
the "it's easy when you know how" department!)
Hmm ... remind me, did we try this?
http://www.findabilityproject.org/?p=228
Do you need FQDN resolution for host names enabled? (It's disabled by default.)
http://snurl.com/95t1x [code.google.com]
Try this to force a full recrawl (slightly different from before):
http://snurl.com/95t30 [code.google.com]
- Joe
Hmm. I wonder if there's a permissions problem on that folder
(preventing the XML from being generated)?
>> Do you need FQDN resolution for host names enabled? (It's disabled by default.)
>> http://snurl.com/95t1x[code.google.com]
>
> Can't do this yet, since there's no such xml file. :(
Aye. :(
- Joe
Auugh! OK, that's frustrating. Thwarted by conflicting docs. :(
Bonus points for your stick-to-it-ness on this. Keep us posted! (Wow I
have a lot of catching up to do with the posts. hehe.)
--
Joe D'Andrea
> *sigh* No dice on the JDK.
Auuugh! Somehow I think there must be a hole in the wall nearby ...
about fist size. :(
OK. Broader RFC going out here (or RFH - request for help?). Anyone
else ever have to work through issues when getting Sharepoint and the
GSA to play nice?
- Joe
I feel like we should just start fresh. Maybe there's some silly thing
we missed. :-o
- Joe
> Ok, so I was reviewing the document from my original post, and I
> noticed the supported operating systems. I'm not very familiar with
> differences between WS2K3 and R2; I think it's mainly features, but
> might it include something, or is it enough that it could cause
> problems? The document specifies R2 in supported OSs, and we're
> running WS2K3 Enterprise (but not R2, to my knowledge).
I have access to WS2k3 but it's on an intranet - no GSA/Mini in the
vicinity. Otherwise I'd try that myself.
I'm still thinking we have a permissions problem somewhere along the
way, but I can't put my finger on it thus far.
- Joe