>>>>> Ivan Shmakov <
iv...@siamics.net> wrote:
[Cross-posting to news:comp.infosystems.www.misc, as the issue
at hand is arguably more related to WWW than to Unix Shell.]
>>> The remote appears to filter by User-Agent:.
>> And what is 'xnyL' ?
> 'Lynx' backwards. But I'm also interested in the rationale behind it.
The rationale behind filtering by User-Agent:, or how did I find
it out?
Per my observations, sites attempt to filter by User-Agent:
to mitigate certain kinds of "abuse", such as unsanctioned
mirroring, or recursive retrieval in general (which is part of
operation of, say, email harvesters.) As such, disallowing
"Wget" -- a popular recursive downloading and mirroring tool --
is not uncommon; I've seen it done at such domains as
arxiv.org,
classiccmp.org and
datasheetcatalog.org. The proper solution
is, of course, to use the /robots.txt control file instead.
(Granted, GNU Wget can be configured to ignore one -- but so
can it be configured to use an arbitrary User-Agent: string.
For which my long-time preference is, and I'm not trying to
surprise anyone, "tegW".)
Personally, I consider it far worse an issue when the recursive
retrieval software misidentifies itself as a common Web user
agent. Per my experience, a number of such requests originate
from
202.46.48.0/20. Like, say:
202.46.54.133 - - 2016-10-15 21:27:23 +0000 "GET / HTTP/1.1" 200 2546 "-"
"Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36"
Worse still is that even those requests from that same network
identified as "Baiduspider/2.0" in my logs do not seem to ever
refer to /robots.txt. As such, I've decided to deny access to
certain sections of my Web sites to certain User-Agent: and
request source IP combinations.
... Another popular option for ad-hoc crawlers is the Net::HTTP
library for Perl, commonly identified by "libwww-perl" in the
User-Agent: header. Incidentally, Lynx has exact same "libwww"
substring in its own default User-Agent: value, leading to some,
what I presume are, "false positives."
Which is one of the reasons why I tend to use somewhat random
User-Agent: strings for my long-running Lynx sessions. Thus,
when I was able to access the site in question from one so
configured Lynx instances perfectly well, and was then refused
access running $ lynx --dump from command line, the "User-Agent"
filtering was my guess right away.