Retrying URL: "Host unreachable while trying to fetch robots.txt

126 views
Skip to first unread message

MAK

unread,
Oct 8, 2007, 4:51:31 PM10/8/07
to Google Search Appliance
Many of the sites I am trying to crawl return the following error...
Retrying URL: "Host unreachable while trying to fetch robots.txt."

Documents located in the same folder will crawl while others will
return the error.


Net Diag Returns...

DNS Server xxx TCP connection timed out - ACL'ed out?
DNS Server xxx OK
DNS Server xxx OK
NTP Server xxx OK
NTP Server xxx OK
SMTP Server xxxHost not Responding
SMTP Server xxx OK
Test URL http://www.xxx.com/robots.txt Host not Respondi
Test URL http://www.xxx.com/robots.txt OK - pingable

Michael Kamara

unread,
Oct 8, 2007, 6:29:27 PM10/8/07
to Google Search Appliance
Also,  Some documenets will crawl, then receive the robots.txt error on a second crawo, then crawl.

Primož Lah

unread,
Oct 9, 2007, 3:31:13 AM10/9/07
to Google-Sear...@googlegroups.com
Try pinging your google server from the opposite direction if you can - maybe a bad ethernet cable somewhere.
...just an idea.


From: Google-Sear...@googlegroups.com [mailto:Google-Sear...@googlegroups.com] On Behalf Of Michael Kamara
Sent: Tuesday, October 09, 2007 12:29 AM
To: Google Search Appliance
Subject: Re: Retrying URL: "Host unreachable while trying to fetch robots.txt

brian

unread,
Oct 9, 2007, 10:14:30 AM10/9/07
to Google Search Appliance
Hmmm,

This is generally not good:


>Test URL http://www.xxx.com/robots.txt Host not Respondi

Do you get this exact same result everytime or is this intermittent?

And also the fact that it is intermittent during the crawl wound
indicate perhaps some network issues. Some things to look at:

1. Do you see the requests in your webserver logs coming through?
2. What happens when you request http://www.xxx.com/robots.txt from a
browser? First clear your cache and restart your browser to make sure
that it is not behind any security. The GSA must receive a 200 or 404
to crawl the site.

http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Introduction.html#robots

If you need to login, then you need to make sure that you add the
credentials for that URL pattern to Crawler Access.
3. Check you network speed settings on the back of the appliance
(orange port) and make sure they match with your switche settings. You
could also try auto-negotiate or some other setting to try it out.
4. Take a tcpdump between your webserver and your GSA. Do you see any
packets getting dropped?

Let us know what you find.

Brian


On Oct 9, 7:29 am, "Michael Kamara" <mikekam...@gmail.com> wrote:
> Also, Some documenets will crawl, then receive the robots.txt error on a
> second crawo, then crawl.
>

Reply all
Reply to author
Forward
0 new messages