[Setting Followup-To: news:comp.infosystems.www.misc.]
>> Any idea why this site (or, rather,
http://mbed.org/ it redirects
>> to) reports 403 Forbidden when there's User-Agent: Lynx... in the
>> HTTP request header? Also strange is that
>>
http://www.wescottdesign.com/ results in 406 Not Acceptable in such
>> a case...
> Both sites seem to dislike that Lynx has "libwww" in the User-Agent
> string. Seems to be a crude anti-robot measure.
ACK, thanks. I've suspected something like that, but stopped
short of actually trying to bisect the Lynx' User-Agent: myself.
One more site to join the league is
http://blog.blitz.io/.
What's really surprising in this case, however, that is such a
configuration doesn't prevent a /robot proper/ from accessing
these sites! Consider, e. g.:
$ wget -O /dev/full --
http://www.wescottdesign.com/ http://mbed.org/
...
Resolving
www.wescottdesign.com (
www.wescottdesign.com)... 137.118.32.70
Connecting to
www.wescottdesign.com (
www.wescottdesign.com)|137.118.32.70|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 27 Mar 2013 11:22:39 GMT
Server: Apache
...
Resolving
mbed.org (
mbed.org)... 217.140.101.20
Connecting to
mbed.org (
mbed.org)|217.140.101.20|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: nginx/1.1.19
Date: Wed, 27 Mar 2013 11:22:40 GMT
...
$
OTOH, I've indeed seen many sites which deny access to Wget
specifically, and not for Lynx. For instance:
http://arxiv.org/
http://www.datasheetcatalog.org/
http://www.gutenberg.org/
Another strange "crawling prevention" measure is to check
Referer:, which is done by, e. g.:
http://www.classicdosgames.com/
http://www.download-central.ws/
Which is easy to overcome by giving the --header='Referer: ...'
option to Wget.
(Although I'm unsure as to was it the intended behavior for
download-central.ws, or just some kind of misconfiguration.)
> I could duplicate the problem with
> lynx -useragent='libwww'
FWIW, $ wget -U libwww gives the same result.
> But not with either of:
> lynx -useragent='libww'
> lynx -useragent='ibwww'
PS. I think I may want to create a list of such "doing silly things"
Web sites...