robot.txt

4 views
Skip to first unread message

n...@netcom.com

unread,
Aug 6, 1995, 3:00:00 AM8/6/95
to

Upon close examination of my log I can see there's some midnight maurader
(of the programatic kind) going around and looking for "robot.txt".
Just what exactly is it looking for in that robot.txt file? I'd like to
feed the little devils.

198.5.208.1 - - [01/Aug/1995:20:05:12 -0400] "GET /robots.txt HTTP/1.0" 404 -
204.62.245.32 - - [06/Aug/1995:10:24:27 -0400] "GET /robots.txt HTTP/1.0" 404 -

Name: corp-uu.infoseek.com
Address: 198.5.208.1

*** netcom7.netcom.com can't find 204.62.245.32: Non-existent host/domain

--
Nancy Milligan Milligan Consulting Services
n...@netcom.com System Administration, Perl/C/shell programming
619/260-1442 Internet Connectivity, Firewalls, Usenet, WWW
San Diego, CA http://nmcs.clever.net

Ed Costello

unread,
Aug 6, 1995, 3:00:00 AM8/6/95
to
In <npmDCw...@netcom.com>, n...@netcom.com wrote:
> Upon close examination of my log I can see there's some midnight maurader
> (of the programatic kind) going around and looking for "robot.txt".
> Just what exactly is it looking for in that robot.txt file? I'd like to
> feed the little devils.

robots.txt is a defacto standard file for webwalkers/spiders/etc to
look for on web sites to determine what to index on that site (or
whether to index the site at all).
The file looks like:

#This is a file retrieved by webwalkers a.k.a. spiders that
#conform to a defacto standard.
#See <URL:"http://web.nexor.co.uk/mak/doc/robots/norobots.html">
#The webmaster for this site is webm...@www.ibm.com
#Format is:
# User-agent: <name of spider>
# Disallow: <nothing> | <path>
#---------------------------------------------------------------------
# following prevents access to /misc/ for spiders
User-Agent: *
Disallow: /misc

#EOF

See <URL:http://web.nexor.co.uk/mak/doc/robots/norobots.html> for more
information.

--
//name dd -ed costello
//email dd cost...@netcom.com

Lynn Bry

unread,
Aug 7, 1995, 3:00:00 AM8/7/95
to

n...@netcom.com (n...@netcom.com) writes:
> Upon close examination of my log I can see there's some midnight maurader
> (of the programatic kind) going around and looking for "robot.txt".
> Just what exactly is it looking for in that robot.txt file? I'd like to
> feed the little devils.

That's just what I thought when I started noting the error messages
piling in the error_log when requests were sent for 'robots.txt'.

It is however, a file that good natured 'bots will first check before
sifting through the contents of your server so they know whether they
are welcome, and if so if there are any places from which they should
steer clear (randomly generated pages, or things that have the
potential for inifinite links)

The file should be of the format:

User-agent: Robot's name
Disallow: directories..

For each robot you that you wish to limit access to your system..

> 198.5.208.1 - - [01/Aug/1995:20:05:12 -0400] "GET /robots.txt HTTP/1.0" 404 -
> 204.62.245.32 - - [06/Aug/1995:10:24:27 -0400] "GET /robots.txt HTTP/1.0" 404 -
>
> Name: corp-uu.infoseek.com
> Address: 198.5.208.1
>
> *** netcom7.netcom.com can't find 204.62.245.32: Non-existent host/domain
>

-ln


Reply all
Reply to author
Forward
0 new messages