Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Guy Macon on the new Google/Yahoo/Microsoft extended ROBOTS.TXT standard

0 views
Skip to first unread message

Guy Macon

unread,
Jun 16, 2008, 9:31:54 AM6/16/08
to


Col Steve Austin Ret wrote:
>
>kenneth wrote:
>
>>Disallow: /*?*
>>Disallow: /*?
>
>and here I thought robots.txt didn't include wildcards

You thought wrong.

Google, Yahoo!, and Microsoft have agreed upon a standard Robots
Exclusion Protocol with wildcard support. See references below.

>From [ http://www.google.com/support/webmasters/bin/answer.py?answer=40367 ]:

|
| I don't want to list every file that I want to block. Can
| I use pattern matching?
|
| Yes, Googlebot interprets some pattern matching. This is an
| extension of the standard, so not all bots may follow it.
|
| Matching a sequence of characters using *
|
| You can use an asterisk (*) to match a sequence of
| characters. For instance, to block access to all
| subdirectories that begin with private, you could use the
| following entry:
|
| User-agent: Googlebot
| Disallow: /private*/
|
| To block access to all URLs that include a question mark
| (?), you could use the following entry:
|
| User-agent: *
| Disallow: /*?
|
| Matching the end characters of the URL using $
| You can use the $ character to specify matching the end of
| the URL. For instance, to block an URLs that end with .asp,
| you could use the following entry:
|
| User-agent: Googlebot
| Disallow: /*.asp$
|
| You can use this pattern matching in combination with the
| Allow directive. For instance, if a ? indicates a session
| ID, you may want to exclude all URLs that contain them to
| ensure Googlebot doesn't crawl duplicate pages. But URLs
| that end with a ? may be the version of the page that you do
| want included. For this situation, you can set your
| robots.txt file as follows:
|
| User-agent: *
| Allow: /*?$
| Disallow: /*?
|
| The Disallow:/ *? line will block any URL that includes a ?
| (more specifically, it will block any URL that begins with
| your domain name, followed by any string, followed by a
| question mark, followed by any string).
|
| The Allow: /*?$ line will allow any URL that ends in a ?
| (more specifically, it will allow any URL that begins with
| your domain name, followed by a string, followed by a ?,
| with no characters after the ?).
|

>From the Google Webmaster Central Blog: Improving
on Robots Exclusion Protocol
[ http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html ]

>From the Official Google Blog: Controlling how
search engines access and index your website
[ http://googleblog.blogspot.com/2007/01/controlling-how-search-engines-access.html ]
[ http://googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html ]

>From the Yahoo search blog: One Standard Fits All: Robots Exclusion
Protocol for Yahoo!, Google and Microsoft
[ http://www.ysearchblog.com/archives/000587.html ]

>From the Microsoft Live Search Webmaster Center Blog:
Robots Exclusion Protocol: Joining Together to Provide
Better Documentation
[ http://blogs.msdn.com/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx ]

>From Google: How do I create a robots.txt file?
[ http://www.google.com/support/webmasters/bin/answer.py?answer=40362 ]

SearchTools.com: About Robots.txt and Search Indexing Robots
[ http://www.searchtools.com/robots/robots-txt.html ]

Wikipedia: Robots.txt
[ http://en.wikipedia.org/wiki/Robots.txt ]

Who invented robots.txt and why is it so brain-dead?
[ http://yro.slashdot.org/comments.pl?sid=377285&cid=21554125 ]

Checklist for Search Robot Crawling and Indexing
[ http://www.searchtools.com/robots/robot-checklist.html ]

Web robots and dynamic content issues
[ http://www.ghita.ro/article/23/web_robots_and_dynamic_content_issues.html ]

Appendix B, section B.4.1 of the HTML 4.01 Specification
[ http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1 ].

A Standard for Robot Exclusion
[ http://www.robotstxt.org/orig.html ]
[ http://www.robotstxt.net/ ]
[ http://www.hirschle.ch/html-kurs/robots/robots.html ]

A Method for Web Robots Control
[ http://www.robotstxt.org/norobots-rfc.txt ]

Proposal: An Extended Standard for Robot Exclusion
[ http://www.conman.org/people/spc/robots2.html ]

Parasites.txt: Addressing The Need for Parasite Inclusion
[ http://www.parasitestxt.org/index.php?page=3 ]

Using Apache to stop bad robots
[ http://evolt.org/article/Using_Apache_to_stop_bad_robots/18/15126/index.html ]

BotSeer: a search engine of robots.txt files
[ http://botseer.ist.psu.edu/about.html ]
[ http://botseer.ist.psu.edu/ ]
[ http://botseer.ist.psu.edu/stat.jsp ]
[ http://botseer.ist.psu.edu/help.jsp ]

Robotcop: block robots that ignore your robots.txt
[ http://www.robotcop.org/ ]
[ http://www.robotcop.org/details.html ]

Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>
Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>
Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>
Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>

Nikita the Spider

unread,
Jun 16, 2008, 9:53:01 PM6/16/08
to
In article <O6mdnZUKW4s...@giganews.com>,

Guy Macon <"http://www.guymacon.com/"@-.-> wrote:

> Col Steve Austin Ret wrote:
> >
> >kenneth wrote:
> >
> >>Disallow: /*?*
> >>Disallow: /*?
> >
> >and here I thought robots.txt didn't include wildcards
>
> You thought wrong.
>
> Google, Yahoo!, and Microsoft have agreed upon a standard Robots
> Exclusion Protocol with wildcard support. See references below.

Not so fast...just because the biggest search engines have agreed on
something doesn't make it a universal standard. What's described on
robotstxt.org is the closest thing there is to a universal standard, and
that standard does *not* allow for wildcards. That is, * and ? will be
interpreted literally.

Popular libraries like those for Python and Perl still make no allowance
for wildcard extensions:
http://docs.python.org/lib/module-robotparser.html
http://perl.active-venture.com/lib/WWW/RobotRules.html

Anything written using those libraries won't respect wildcards in
robots.txt.

The practical upshot is that one can use wildcards in robots.txt, some
bots will respect them and some will not. I'd argue that neither is
terribly wrong.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Guy Macon

unread,
Jun 17, 2008, 1:55:13 PM6/17/08
to


Nikita the Spider wrote:

>> Google, Yahoo!, and Microsoft have agreed upon a standard Robots
>> Exclusion Protocol with wildcard support. See references below.
>

>...just because the biggest search engines have agreed on
>something doesn't make it a universal standard. What's described on
>robotstxt.org is the closest thing there is to a universal standard, and
>that standard does *not* allow for wildcards. That is, * and ? will be
>interpreted literally.
>
>Popular libraries like those for Python and Perl still make no allowance
>for wildcard extensions:
>http://docs.python.org/lib/module-robotparser.html
>http://perl.active-venture.com/lib/WWW/RobotRules.html
>
>Anything written using those libraries won't respect wildcards in
>robots.txt.
>
>The practical upshot is that one can use wildcards in robots.txt,
>some bots will respect them and some will not. I'd argue that
>neither is terribly wrong.

The good news is that every version of robots.txt allows you to
specify which exclusion rules apply to which robots, and thus
you can make an exclusion rule with wildcards for Google, Yahoo,
and Microsoft, and an exclusion rule without wildcards (one that
conforms to Appendix B, section B.4.1 of the HTML 4.01 Spec) for
all other web crawling robots. I would put the sections with the
wildcards at the end so as to minimize the chances of confusing
the other robots.

What we really need is an updated version of robotcop that works
with the latest version of Apache and the latest robot exclusion
standards. I would argue that stopping spambots and other rude
robots is more important than accomodating older robots that do
not follow the new Google/Yahoo/Microsoft standard. Alas, the
robotcop project at www.robotcop.org appears to have been
abandoned by the developers back in 2002. :(


Guy Macon
<http://www.guymacon.com/>

0 new messages