Warning: robots.txt unreliable in Apache servers

Philip Ronan

unread,

Oct 29, 2005, 7:07:46 PM10/29/05

to

Hi,

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
<http://apache.org/foundation/> and <http://apache.org//////foundation/>
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

Phil

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/

Borek

unread,

Oct 29, 2005, 7:33:36 PM10/29/05

to

On Sun, 30 Oct 2005 01:07:46 +0200, Philip Ronan <inv...@invalid.invalid>
wrote:

> Disallow: /foundation/
>
> But if I posted a link to //////foundation/ somewhere, the search engines
> will be quite happy to index it because it isn't covered by this rule.

Disallow: ////foundation

doesn't work? Just wild guess.

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples

David Dyer-Bennet

unread,

Oct 29, 2005, 8:01:22 PM10/29/05

to

Borek <bo...@parts.bpp.to.com.remove.pl> writes:

> On Sun, 30 Oct 2005 01:07:46 +0200, Philip Ronan
> <inv...@invalid.invalid> wrote:
>
> > Disallow: /foundation/
> >
> > But if I posted a link to //////foundation/ somewhere, the search engines
> > will be quite happy to index it because it isn't covered by this rule.
>
> Disallow: ////foundation
>
> doesn't work? Just wild guess.

It may work, but somebody actively attempting to cause google to index
a page you don't want indexed (but which is visible to people) can do
all sorts of variants on that, and you'd have to notice and block them
one by one.
--
David Dyer-Bennet, <mailto:dd...@dd-b.net>, <http://www.dd-b.net/dd-b/>
RKBA: <http://noguns-nomoney.com/> <http://www.dd-b.net/carry/>
Pics: <http://dd-b.lighthunters.net/> <http://www.dd-b.net/dd-b/SnapshotAlbum/>
Dragaera/Steven Brust: <http://dragaera.info/>

Borek

unread,

Oct 30, 2005, 3:40:47 AM10/30/05

to

On Sun, 30 Oct 2005 02:01:22 +0200, David Dyer-Bennet <dd...@dd-b.net>
wrote:

>>> Disallow: /foundation/
>>>
>>> But if I posted a link to //////foundation/ somewhere, the search
>>> engines
>>> will be quite happy to index it because it isn't covered by this rule.
>>
>> Disallow: ////foundation
>>
>> doesn't work? Just wild guess.
>
> It may work, but somebody actively attempting to cause google to index
> a page you don't want indexed (but which is visible to people) can do
> all sorts of variants on that, and you'd have to notice and block them
> one by one.

I am aware if the problem, just in may cases it is just a stupid
typo or bugged script generating such link on some other page.
If so, //// in robots.txt can be enough.

Philip Ronan

unread,

Oct 30, 2005, 4:32:02 AM10/30/05

to

"Borek" wrote:

> I am aware if the problem, just in may cases it is just a stupid
> typo or bugged script generating such link on some other page.
> If so, //// in robots.txt can be enough.

Multiple slashes are ignored *anywhere* in a URL.

So if you have a rule like "Disallow: /path/to/some-file.html", then it
could be bypassed by any of the following URLs:

//path/to/some-file.html
/path//to/some-file.html
/path/to//some-file.html
/path/to//some-file.html
//path//to//some-file.html

Simply disallowing paths that start with a double forward slash isn't good
enough. and since the robots.txt protocol doesn't allow pattern matching
(e.g. regular expressions), it simply isn't possible to trap every
possibility.

If there's a bug anywhere, it's in the Apache software.

Benjamin Niemann

unread,

Oct 30, 2005, 5:05:02 AM10/30/05

to

Philip Ronan wrote:

> "Borek" wrote:
>
>> I am aware if the problem, just in may cases it is just a stupid
>> typo or bugged script generating such link on some other page.
>> If so, //// in robots.txt can be enough.
>
> Multiple slashes are ignored *anywhere* in a URL.
>

> If there's a bug anywhere, it's in the Apache software.

It's completely up to the server how it maps URL to local files - at least I
could not find any normative rules that say anything else. Pathes
like /path/to//some-file.html are often generated by badly written client
software, and apache decided to tolerate this (it's not a bug, it's a
feature ;). IIS seems to work similarily (no extensive testing, but
http://www.microsoft.com///robots.txt works).
Robots should adapt to this behavior and remove empty segments before
matching a path against the robots.txt rules.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Philip Ronan

unread,

Oct 30, 2005, 6:15:03 AM10/30/05

to

"Nick Kew" wrote:

> If you have links to things like "////" and dumb robots, put the
> paths in your robots.txt. Don't forget that robots.txt is only
> advisory and is commonly ignored by evil and/or broken robots.

But retroactively adding to the robots.txt file every time someone posts a
bad link to your site just isn't a practical solution. I realize not all
robots bother with the robots.txt protocol, but if even the legitimate
spiders can be misdirected then the whole point of having a robots.txt file
goes out the window.

Philip Ronan

unread,

Oct 30, 2005, 6:12:09 AM10/30/05

to

in alt.internet.search-engines, "Benjamin Niemann" wrote:

> It's completely up to the server how it maps URL to local files - at least I
> could not find any normative rules that say anything else. Pathes
> like /path/to//some-file.html are often generated by badly written client
> software, and apache decided to tolerate this (it's not a bug, it's a
> feature ;).

In c.i.w.a.h, "Benjamin Niemann" wrote:

> I would tend to blame googlebot (and any other effected robot). Unless a
> different behaviour ('...foo//bar...' and '...foo/bar...' resolve to
> different resource on the server) is common practice, the robot should
> normalize such pathes (removing empty segments) before matching it against
> the rules from the robots.txt file.

(Now crossposting to both groups)

I did contact Google about this, and all I got was a standard reply to the
effect that there's no reason why the robots.txt rule "Disallow: /path"
should apply to both "/path" and "//path". They *do* have a point.

And until/unless Google and all the other search engines start behaving as
you suggest (and let's face it, they never will), it's going to be the
responsibility of site owners to make sure this problem is dealt with at
source.

Philip Ronan

unread,

Oct 30, 2005, 6:12:14 AM10/30/05

to

OK, here's the fix I'm using (Apache/PHP). Feel free to comment.

1. Put this in your /.htaccess file:
====================================
RewriteEngine On
RewriteCond %{REQUEST_URI} //+
RewriteRule ^(.*)$ /slashfix.php [L]

2. Create a file called /slashfix.php, containing the following:
================================================================
<?php

$prot = $_SERVER["SERVER_PROTOCOL"];
$host = "http://" . $_SERVER["HTTP_HOST"];
$req = $_SERVER["REQUEST_URI"];
$script = $_SERVER["SCRIPT_NAME"];
$sig = $_SERVER["SERVER_SIGNATURE"];
$newLoc = $host . preg_replace('|//+|', '/', $req);

if ($req == $script) die("Unable to redirect.\r\n");

header("$prot 301 Moved Permanently");
header("Location: $newLoc");

?><HTML>
<HEAD>
<TITLE>Invalid Request URI</TITLE>
</HEAD>
<BODY>
<H1>Invalid Request URI</H1>
<P>The URI you provided contains consecutive forward slashes,
which are not accepted by this website. Please try visiting
<A href="<?php echo $newLoc; ?>"><?php echo $newLoc; ?></A> instead.</P>
<HR>
<?php echo $sig; ?>
</BODY>
</HTML>

Nick Kew

unread,

Oct 30, 2005, 7:26:41 AM10/30/05

to

Philip Ronan wrote:

[please don't crosspost without warning. Or with inadequate context]

> "Nick Kew" wrote:
>
>
>>If you have links to things like "////" and dumb robots, put the
>>paths in your robots.txt. Don't forget that robots.txt is only
>>advisory and is commonly ignored by evil and/or broken robots.
>
>
> But retroactively adding to the robots.txt file every time someone posts a
> bad link to your site just isn't a practical solution.

Who said anything about that? What's impractical about "Disallow //" ?

--
Nick Kew

Philip Ronan

unread,

Oct 30, 2005, 9:44:29 AM10/30/05

to

"Nick Kew" wrote:

> [please don't crosspost without warning. Or with inadequate context]

My original post was copied over to ciwah, so now there are two threads with
the same subject. I'm trying to tie them together, mkay?

> Philip Ronan wrote:
>>
>> But retroactively adding to the robots.txt file every time someone posts a
>> bad link to your site just isn't a practical solution.
>
> Who said anything about that?

You did, in your earlier post: [quote]If you have links to things like
"////" and dumb robots, put the paths in your robots.txt.[/quote]

> What's impractical about "Disallow //" ?

It's a partial solution. If you're trying to protect content at deeper
levels in the hierarchy, you will also need:

Disallow: /path//to/file
Disallow: /path/to//file
Disallow: /path//to//file
Disallow: /path///to/file
etc..

As I said, robots.txt is inadequate for this purpose because it doesn't
support pattern matching.

Philip Ronan

unread,

Oct 30, 2005, 9:44:33 AM10/30/05

to

In comp.infosystems.www.authoring.html, "Stan Brown" wrote:

> Wouldn't it be more effective to have any URL containing http://.*//
> return a 403 Forbidden or a 404 Not Found? This could be done in
> .htaccess or perhaps httpd.conf. I may be having a failure of
> imagination, but I can't think of any legitimate reason for such a
> link.

That would also be effective, but maybe it's better to do something useful
with the URL if you can.

Most servers will redirect to a URL with a trailing slash when the name of a
directory is requested. Why not treat multiple slashes in a similar way?

Besides, it might help in terms of page rank.

[[Crossposted to alt.internet.search-engines, with apologies to Nick]]

David

unread,

Oct 30, 2005, 10:38:05 AM10/30/05

to

On Sun, 30 Oct 2005 11:15:03 +0000, Philip Ronan
<inv...@invalid.invalid> wrote:

>"Nick Kew" wrote:
>
>> If you have links to things like "////" and dumb robots, put the
>> paths in your robots.txt. Don't forget that robots.txt is only
>> advisory and is commonly ignored by evil and/or broken robots.
>
>But retroactively adding to the robots.txt file every time someone posts a
>bad link to your site just isn't a practical solution. I realize not all
>robots bother with the robots.txt protocol, but if even the legitimate
>spiders can be misdirected then the whole point of having a robots.txt file
>goes out the window.

A simple solution would be to add the robots meta tag to all pages you
don't want indexing as a backup for when someone links with //. Kind
of defeats the whole point of using a robots.txt file, but what else
can you do?

David
--
Free Search Engine Optimization Tutorial
http://www.seo-gold.com/tutorial/

Guy Macon

unread,

Oct 30, 2005, 10:44:56 AM10/30/05

to

It's situations like this that led me to use a <meta name="robots"
content="noindex, nofollow" /> in addition to robots.txt on all
XHTML pages that I wish to exclude. The worst that can happen
is that I waste 50 bytes, and it might catch some 'bots that would
otherwise index the page because of the // bug or some other problem.

--
Guy Macon <http://www.guymacon.com/>

Guy Macon

unread,

Oct 30, 2005, 1:46:36 PM10/30/05

to

David Ross wrote:

>
>Philip Ronan wrote:
>>
>> I recently discovered that robots.txt files aren't necessarily any use on
>> Apache servers.
>>
>> For some reason, the Apache developers decided to treat multiple consecutive
>> forward slashes in a request URI as a single forward slash. So for example,
>> <http://apache.org/foundation/> and <http://apache.org//////foundation/>
>> both resolve to the same page.
>>
>> Let's suppose the Apache website owners want to stop search engine robots
>> crawling through their "foundation" pages. They could put this rule in their
>> robots.txt file:
>>

>> Disallow: /foundation/
>>
>> But if I posted a link to //////foundation/ somewhere, the search engines
>> will be quite happy to index it because it isn't covered by this rule.
>>

>> As a result of all this, Google is currently indexing a page on my website
>> that I specifically asked it to stay away from :-(
>>
>> You might want to check the behaviour of your servers to see if you're
>> vulnerable to the same sort of problem.
>>
>> If anyone's interested, I've put together a .htaccess rule and a PHP script
>> that seem to sort things out.
>

>I thought that parsing and processing a robots.txt file is the
>responsibility of the bot and not the Web server. All the Web
>server has to do is deliver the robots.txt file to the bot.
>
>If that is true, the problem lies within Google and not Apache.

I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

Brian Wakem

unread,

Oct 30, 2005, 2:14:23 PM10/30/05

to

Guy Macon <http://www.guymacon.com/> wrote:

> I was about to opine that "http://apache.org//////" is not the same
> as "http://apache.org/", but it appears that IIS has the same behavior:
> See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
> Is there something in the specs that says that treating "//////" and
> "/" the same is proper behavior?
>

Don't know, but it seems to be the case on unix/linux filesystems too,

If I 'cd //////usr////////////local////apache2' I end up
in /usr/local/apache2

The web servers are probably mimicking this behaviour.

--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png

alain

unread,

Oct 30, 2005, 2:51:33 PM10/30/05

to

Brian Wakem wrote:
>> I was about to opine that "http://apache.org//////" is not the same
>> as "http://apache.org/", but it appears that IIS has the same behavior:
>> See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
>> Is there something in the specs that says that treating "//////" and
>> "/" the same is proper behavior?
>>
> Don't know, but it seems to be the case on unix/linux filesystems too,
>
> If I 'cd //////usr////////////local////apache2' I end up
> in /usr/local/apache2

Same goes for Windows/DOS;
'cd ///windows///system32' brings you to '/windows/system32'.

--
Touched By His Noodly Appendage

Borek

unread,

Oct 30, 2005, 3:04:37 PM10/30/05

to

On Sun, 30 Oct 2005 20:51:33 +0100, alain <al...@spamcop.net> wrote:

>> Don't know, but it seems to be the case on unix/linux filesystems too,
>> If I 'cd //////usr////////////local////apache2' I end up
>> in /usr/local/apache2
>
> Same goes for Windows/DOS;
> 'cd ///windows///system32' brings you to '/windows/system32'.

Interesting, my Windows don't accept /, but they accept \ ;)

Best,
Borek
--
http://www.chembuddy.com

http://www.chembuddy.com/?left=BATE&right=basic_acid_titration_equilibria
http://www.chembuddy.com/?left=CASC&right=concentration_and_solution_calculator

Jim Moe

unread,

Oct 30, 2005, 3:18:21 PM10/30/05

to

Guy Macon wrote:
>
> I was about to opine that "http://apache.org//////" is not the same
> as "http://apache.org/", but it appears that IIS has the same behavior:
> See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
> Is there something in the specs that says that treating "//////" and
> "/" the same is proper behavior?
>

You are referring to which specs?
This behavior for following paths is from unix and is how all C
compilers handle paths. It is simply applied to URLs as well. There may
even be a requirement in the C specification about paths.

--
jmm (hyphen) list (at) sohnen-moe (dot) com
(Remove .AXSPAMGN for email)

Dave0x1

unread,

Oct 30, 2005, 3:45:32 PM10/30/05

to

Guy Macon wrote:

> I was about to opine that "http://apache.org//////" is not the same
> as "http://apache.org/", but it appears that IIS has the same behavior:
> See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
> Is there something in the specs that says that treating "//////" and
> "/" the same is proper behavior?

Hint: Read the documentation offered at either of the first two URLs.

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Dave

Philip Ronan

unread,

Oct 30, 2005, 6:38:50 PM10/30/05

to

"Dave0x1" wrote:

> I don't understand why this is a big deal. The issue can be addressed
> by numerous methods, including patching of the Apache web server source
> code.

OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.

> It's not clear exactly what the problem *is*. I've never seen a URL
> with multiple adjacent forward slashes in my search results. Does
> someone have an example?

Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:

>> Thank you for your note. We apologize for our delayed response.
>> We understand you're concerned about the inclusion of
>> http://###.####.###//contact/ in our index.
>>
>> It's important to note that we visited the live page in question
>> and found that it currently exists on the web as listed above.
>> Because this page falls outside your robots.txt file, you may
>> want to use meta tags to remove this page from our index. For
>> more information about using meta tags, please visit
>> http://www.google.com/remove.html
>>
>> [remainder snipped]

I didn't publish the link to //contact/, someone else did. So that means the
robots.txt protocol is ineffective on (probably) most servers because it can
be circumvented without your knowledge by a third party.

Hope that's all clear now.

Borek

unread,

Oct 30, 2005, 6:43:59 PM10/30/05

to

On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <a...@example.com> wrote:

> It's not clear exactly what the problem *is*. I've never seen a URL
> with multiple adjacent forward slashes in my search results. Does
> someone have an example?

/%3Fleft%3DpH-calculation%26right%3Dtoc&hl=pt-BR&lr=lang_pt&sa=G
/?left=BATE&amp%3Bright=phcalculation
/?left=BATE&amp;right=dissociation_constants
/?left=BATE&right=basic_acid_titr
/?left=BATE&right=basic_acid_titration_equilbria
/?left=BATE&right=basic_acid_titration_equilibri
/?left=BATE&right=basic_acid_titration_equilibria">pH
/?left=BATE&right=basic_acid_titration_equilibria%22%3EpH
/?left=BATE&right=basic_acid_titration_equilibria/////////////////////////////////////////////////////
/?left=BATE&right=dissociation_constants]</td></tr><tr>
/?left=casc&amp/
/?left=casc&right=download
/?left=faq/
/?left=dave-is-great
/?left=BATE&right=basic_acid_titration_equilibria/
/index.php[left]BATE[right]overview[SiteID]simtel.net
/pHlecimg/3-f.png
/pHlecimg/3-g.png
/?left=pH-calculation
/?left=casc&right=concentration_and_solution_calculator
/?left=casc&right=density_tables
/files/CASCInstall.ziphttp:/www.chembuddy.com/files/CASCInstall.exe
/?left=bate&right=dissociation_constants
/?left=bate&right=download
/?left=bate&right=screenshots
/this_is_a_test_of_404_response
/?left=CASC&right=buy
/?left=CASC&right=concentration_and_solution_calculator://
/?left=CASC&right=density_tables
/?left=BATE&right=right=basic_acid_titration_equilibria

All of these generated 404 in last few weeks on my site.

No additional slashes inside of the url, although several times
they were added at the end.

& vs & and wrong capitalization (bate, casc instead of BATE, CASC)
are most prominent sources of errors. But it seems every error is possible
:)

Best,
Borek
--

Message has been deleted

Guy Macon

unread,

Oct 31, 2005, 1:07:36 AM10/31/05

to

Dave0x1 wrote:

>It's not clear exactly what the problem *is*. I've never seen a URL
>with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

Philip Ronan

unread,

Oct 31, 2005, 3:51:58 AM10/31/05

to

"D. Stussy" wrote:

> On Sun, 30 Oct 2005, Philip Ronan wrote:
>> if even the legitimate spiders can be misdirected then the whole point of
>> having a robots.txt file goes out the window.
>

> No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
> robots into our web sites. Those robots which behave and respect the
> robots.txt file will NEVER fall into these traps.

You seem to have misunderstood the problem. These robots CAN and DO access
pages prohibited by robots.txt files due to the way servers process
consecutive slashes in request URIs.

> To the original poster:

Yes, that's me.

> Just because you haven't planned for the malicious to happen shows us that you
> are closed minded. Your reliance ONLY on robots.txt shows this also. Open up
> your thinking.

You also seem to have misunderstood the whole point of this thread. I'm not
asking for help here. I'm just warning people about the unreliability of
robots.txt as a means of excluding your pages from search engines.

> You should also probably use the "robots" meta-tag on each HTML page.

That was the first thing I did when I noticed there was a problem.

> Have you considered using the rewrite engine to trap for "//" in the URI (the
> part of the URL after the protocol and domain name is removed)?

You haven't been paying attention, have you?
<http://groups.google.com/group/alt.internet.search-engines/msg/9a0f7baad24c
74dc?hl=en&>

Guy Macon

unread,

Oct 31, 2005, 5:08:23 AM10/31/05

to

Philip Ronan wrote:
>
>"D. Stussy" wrote:
>
>> On Sun, 30 Oct 2005, Philip Ronan wrote:
>>
>>> if even the legitimate spiders can be misdirected then the whole point of
>>> having a robots.txt file goes out the window.
>>
>> No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
>> robots into our web sites. Those robots which behave and respect the
>> robots.txt file will NEVER fall into these traps.
>
>You seem to have misunderstood the problem. These robots CAN and DO access
>pages prohibited by robots.txt files due to the way servers process
>consecutive slashes in request URIs.

Which means that the operators of the bad robots can put up a few
multiple-slash links so as to lure good robots into those honeypots
and traps, thus discouraging their use. The good news is that all
of the good robots that I know of obey metas as well as robots.txt,
so they can only do that to honypot owners who are under the same
the mistaken belief that Mr. Stussy expressed above - that "Those
robots which behave and respect the robots.txt file will never fall
into these traps."

I am still hoping that one of the .htaccess experts will come up
with a way to make all multiple-slash requests 301 redirect to
their single-slash versions.

Message has been deleted

Philip Ronan

unread,

Oct 31, 2005, 6:23:52 AM10/31/05

to

"D. Stussy" wrote:

> On Mon, 31 Oct 2005, Philip Ronan wrote:
>>
>> You seem to have misunderstood the problem. These robots CAN and DO access
>> pages prohibited by robots.txt files due to the way servers process
>> consecutive slashes in request URIs.
>

> No - not in a PROPERLY set up system they won't.

If by "properly set up" you are referring to a system that has been fixed to
redirect or deny requests for URIs containing consecutive slashes, then
that's correct. In fact that's what I've been suggesting all along in this
thread.

> If one is trapping for a "//" (any number of slashes greater than one),
> then the robots (or anyone/anything else) will ever get there.

I don't understand what you mean. If you think the addition of a rule
"Disallow: //" will completely fix the problem then you're mistaken. I've
already explained why.

>> You haven't been paying attention, have you?
>> <http://groups.google.com/group/alt.internet.search-engines/msg/9a0f7baad24c
>> 74dc?hl=en&>
>

> I'm NOT reading this message from that group. If all the messages in the
> thread weren't also crossposted to the group I'm reading this from -
> comp.infosystems.www.authoring.html, TFB. Deal with it.

Maybe there's something wrong with your newsreader then.
<http://groups.google.com/group/comp.infosystems.www.authoring.html/msg/9a0f
7baad24c74dc>

Tim

unread,

Oct 31, 2005, 6:33:06 AM10/31/05

to

Philip Ronan:

> the robots.txt protocol is ineffective on (probably) most servers because
> it can be circumvented without your knowledge by a third party.

It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.

Likewise with Google's advice:

>> Because this page falls outside your robots.txt file, you may want to
>> use meta tags to remove this page from our index.

In either case, such restrictions only help reduce the load on your server
from well meaning robots. If you want to truly restrict access, you need
to use some form of authentication.

There was moves to suggest the robots exclusion ought to let you specify
what you allow and disallow. For some cases it'd be easier to exclude
everything by default, only allowing what you want through. Though I
don't think that ever took off.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Borek

unread,

Oct 31, 2005, 6:48:47 AM10/31/05

to

On Mon, 31 Oct 2005 11:58:36 +0100, D. Stussy <att-...@bde-arc.ampr.org>
wrote:

> That's not a mistaken belief. Technically, a double (or more) slash,
> other than following a colon when separating a protocol from a domain
> name and not counting the query string, is NOT a valid URL construct.
> Robots which accept them are misbehaved.

Quoting RFC1738 (BNF description of url):

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

* means 0 or infinite repetitions, thus hsegment can be empty.
If hsegment can be empty, hpath may contain multiple not separated slashes.
So it seems you are wrong - multiple slashes in URLs are valid.

Philip Ronan

unread,

Oct 31, 2005, 8:42:36 AM10/31/05

to

"Tim" wrote:

> Philip Ronan:
>
>> the robots.txt protocol is ineffective on (probably) most servers because
>> it can be circumvented without your knowledge by a third party.
>
> It always has been, anyway. For numerous reasons. Your multiple slash
> example is just one of them. Some robots will ignore them altogether,
> others will deliberately look at what you tell them to ignore.

What you're saying is that it's pointless putting absolute faith in
robots.txt files because they are ignored by some robots. I'm not disputing
that. What I'm saying is that even genuine well-behaved robots like
Googlebot can be made to crawl content prohibited by robots.txt files.

So for example, if you're using a honeypot to block badly behaved robots
from your website automatically, then I can *remove your site from Google*
and probably other search engines simply by publishing a link to your
honeypot directory with an extra slash inserted somewhere. That's why this
issue is important.

I hope you understand now.

Big Bill

unread,

Oct 31, 2005, 8:58:00 AM10/31/05

to

On Mon, 31 Oct 2005 10:58:36 GMT, "D. Stussy"
<att-...@bde-arc.ampr.org> wrote:

>On Mon, 31 Oct 2005, Guy Macon wrote:
>> Philip Ronan wrote:
>> >
>> >"D. Stussy" wrote:
>> >
>> >> On Sun, 30 Oct 2005, Philip Ronan wrote:
>> >>
>> >>> if even the legitimate spiders can be misdirected then the whole point of
>> >>> having a robots.txt file goes out the window.
>> >>
>> >> No, it hasn't. Some of us have built honeypots and traps for MISBEHAVED
>> >> robots into our web sites. Those robots which behave and respect the
>> >> robots.txt file will NEVER fall into these traps.
>> >
>> >You seem to have misunderstood the problem. These robots CAN and DO access
>> >pages prohibited by robots.txt files due to the way servers process
>> >consecutive slashes in request URIs.
>>
>> Which means that the operators of the bad robots can put up a few
>> multiple-slash links so as to lure good robots into those honeypots
>> and traps, thus discouraging their use. The good news is that all
>> of the good robots that I know of obey metas as well as robots.txt,
>> so they can only do that to honypot owners who are under the same
>> the mistaken belief that Mr. Stussy expressed above - that "Those
>> robots which behave and respect the robots.txt file will never fall
>> into these traps."
>

>That's not a mistaken belief. Technically, a double (or more) slash, other
>than following a colon when separating a protocol from a domain name and not
>counting the query string, is NOT a valid URL construct. Robots which accept
>them are misbehaved.
>

>> I am still hoping that one of the .htaccess experts will come up
>> with a way to make all multiple-slash requests 301 redirect to
>> their single-slash versions.
>

>Trivial. Do it yourself.

Umm, no, I think we'll hand the floor over to you at this point.

BB
--
www.kruse.co.uk/ s...@kruse.demon.co.uk
Elvis does my SEO

Borek

unread,

Oct 31, 2005, 9:47:40 AM10/31/05

to

On Mon, 31 Oct 2005 12:48:47 +0100, Borek
<bo...@parts.bpp.to.com.remove.pl> wrote:

>> That's not a mistaken belief. Technically, a double (or more) slash,
>> other than following a colon when separating a protocol from a domain
>> name and not counting the query string, is NOT a valid URL construct.
>> Robots which accept them are misbehaved.
>
> Quoting RFC1738 (BNF description of url):
>
> ; HTTP
>
> httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
> hpath = hsegment *[ "/" hsegment ]
> hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
> search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
>
> * means 0 or infinite repetitions, thus hsegment can be empty.

Small correction - 0 _to_ infinite repetitions, not OR infinite
repetitions. But it doesn't change final conclusion.

> If hsegment can be empty, hpath may contain multiple not separated
> slashes.
> So it seems you are wrong - multiple slashes in URLs are valid.

Best,
Borek
--
http://www.chembuddy.com

http://www.chembuddy.com/?left=BATE&right=basic_acid_titration_equilibria
http://www.chembuddy.com/?left=CASC&right=concentration_and_solution_calculator

Robi

unread,

Oct 31, 2005, 9:58:43 AM10/31/05

to

Philip Ronan wrote in message news:BF8BAF28.3A0BA%inv...@invalid.invalid...

> "D. Stussy" wrote:
>> On Mon, 31 Oct 2005, Philip Ronan wrote:

[...]

>>> You haven't been paying attention, have you?
>>> <http://groups.google.com/group/alt.internet.search-engines/msg/9a0f7baad24c
>>> 74dc?hl=en&>
>>
>> I'm NOT reading this message from that group. If all the messages in the
>> thread weren't also crossposted to the group I'm reading this from -
>> comp.infosystems.www.authoring.html, TFB. Deal with it.
>
> Maybe there's something wrong with your newsreader then.
> <http://groups.google.com/group/comp.infosystems.www.authoring.html/msg/9a0f
> 7baad24c74dc>

I don't know what is worse than telling someone
"there's something wrong with newsreader"
and at the same time posting b0rken links.

Nick Kew

unread,

Oct 31, 2005, 9:00:46 AM10/31/05

to

Philip Ronan wrote:

> If by "properly set up" you are referring to a system that has been fixed to
> redirect or deny requests for URIs containing consecutive slashes, then
> that's correct. In fact that's what I've been suggesting all along in this
> thread.

Feel free to set up your server like that. Apache provides a range of
mechanisms for doing so, which you can read all about at apache.org.
It only applies default rules (map to the filesystem) if you haven't
asked it to do otherwise.

--
Nick Kew

Borek

unread,

Oct 31, 2005, 10:08:26 AM10/31/05

to

On Mon, 31 Oct 2005 15:58:43 +0100, Robi <m...@privacy.net> wrote:

> and at the same time posting b0rken links.

Is it a typo, or bad joke? ;)

Robi

unread,

Oct 31, 2005, 10:43:56 AM10/31/05

to

Borek wrote in message news:op.szim0cam584cds@borek...

> On Mon, 31 Oct 2005 15:58:43 +0100, Robi wrote:
>
> > and at the same time posting b0rken links.
>
> Is it a typo, or bad joke? ;)

http://www.bennetyee.org/http_webster.cgi?isindex=borken

nothing to do with your name, sorry ;-)

Philip Ronan

unread,

Oct 31, 2005, 10:54:47 AM10/31/05

to

"Robi" wrote:

> Philip Ronan wrote in message news:BF8BAF28.3A0BA%inv...@invalid.invalid...
>>

>> Maybe there's something wrong with your newsreader then.
>> <http://groups.google.com/group/comp.infosystems.www.authoring.html/msg/9a0f
>> 7baad24c74dc>
>
> I don't know what is worse than telling someone
> "there's something wrong with newsreader"
> and at the same time posting b0rken links.

... using a crap newsreader and blaming everyone else when it doesn't work?

If your newsreader can't handle this link:
<http://groups.google.com/group/comp.infosystems.www.authoring.html/msg/9a0f
7baad24c74dc>

then try this one instead: <http://tinyurl.com/89bmv>

If you're not too busy then try this one too:
<http://rfc.net/rfc2396.html#sE.>

Guy Macon

unread,

Oct 31, 2005, 11:58:48 AM10/31/05

to

Tim wrote:
>
>Philip Ronan:
>
>> the robots.txt protocol is ineffective on (probably) most servers because
>> it can be circumvented without your knowledge by a third party.
>
>It always has been, anyway. For numerous reasons. Your multiple slash
>example is just one of them. Some robots will ignore them altogether,
>others will deliberately look at what you tell them to ignore.

The robots.txt protocol has always been ineffective on bad
robots, but this is, as far as I know, the first example of
it being ineffective on good robots.

--
Guy Macon <http://www.guymacon.com>

Guy Macon

unread,

Oct 31, 2005, 12:14:21 PM10/31/05

to

D. Stussy wrote:

>
>Guy Macon wrote:
>
>> I am still hoping that one of the .htaccess experts will come up
>> with a way to make all multiple-slash requests 301 redirect to
>> their single-slash versions.
>

>Trivial. Do it yourself.

What I described appears to not only be non-trivial, but also
appears to be impossible. Feel free to prove me wrong by posting
a counterexample that redirects all multiple-slash requests to
their single-slash versions. I don't think that you can do it,
but I am not an expert on .htaccess wizardry, so I may be wrong.

One would think that if such a trivial fix existed that someone
in the last 40+ posts would have posted it, thus solving the
problem...

--
Guy Macon <http://www.guymacon.com/>

Philip Ronan

unread,

Oct 31, 2005, 12:46:05 PM10/31/05

to

"Guy Macon" wrote:

> One would think that if such a trivial fix existed that someone
> in the last 40+ posts would have posted it, thus solving the
> problem...

Guy, if you've seen my solution at <http://tinyurl.com/89bmv> and you
haven't got access to PHP, you could try a recursive solution using
.htaccess by itself:

RewriteEngine On
RewriteCond %{REQUEST_URI} //+
RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]

I haven't tested this, but -- in theory -- if the server detects a cluster
of forward slashes in a request URI, it will redirect the client to a URI
containing a single slash in its place. If a request contains more than one
cluster of forward slashes, then the client will be redirected more than
once, but it should eventually get to the right place.

Alan J. Flavell

unread,

Oct 31, 2005, 1:39:29 PM10/31/05

to

On Mon, 31 Oct 2005, Philip Ronan wrote:

> RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]

Actually, a RewriteMatch would suffice, it doesn't need the
full panoply of mod_rewrite...

Your regex doesn't do quite what you hope, due to the greedy nature of
the first "(.*)"

Incidentally, I recommend "pcretest" for this kind of fun.

$ pcretest
PCRE version 3.9 02-Jan-2002

re> "^(.*)//+(.*)$"
data> /one////two/three
0: /one////two/three
1: /one//
2: two/three

As you see, $1 captures a pair of slashes which you really wanteed
to be captured by your "//+" portion. As I say, I made the same
mistake at first.

I'd then got closer, with ^(.*?)/{2,}(.*)$ $1/$2

re> "^(.*?)/{2,}(.*)$"
data> /one////two/three
0: /one////two/three
1: /one
2: two/three

with the end result being /one/two/three , as desired.

I think your "//+" is pretty much synonymous with my "/{2,}";
the key difference is to make the first regex non-greedy.

> If a request contains more than one cluster of forward slashes, then
> the client will be redirected more than once, but it should
> eventually get to the right place.

Indeed.

But aren't there also analogous abuse possibilities with things like
/././ and /.././ and so on?

Alan J. Flavell

unread,

Oct 31, 2005, 1:42:33 PM10/31/05

to

On Mon, 31 Oct 2005, Alan J. Flavell wrote:

> On Mon, 31 Oct 2005, Philip Ronan wrote:
>
> > RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]
>
> Actually, a RewriteMatch would suffice,

*RATS*: I meant of course "RedirectMatch". Sorry.

But I think the rest of what I posted is OK.

Philip Ronan

unread,

Nov 1, 2005, 5:36:32 AM11/1/05

to

"Alan J. Flavell" wrote:

> On Mon, 31 Oct 2005, Philip Ronan wrote:
>
>> RewriteRule ^(.*)//+(.*)$ $1/$2 [R=301,L]
>

> Actually, a [RedirectMatch] would suffice, it doesn't need the

> full panoply of mod_rewrite...
>
> Your regex doesn't do quite what you hope, due to the greedy nature of
> the first "(.*)"

Ah, well spotted. :-)

In which case, this ought to do the trick:

# Eliminate forward slash clusters
RedirectMatch 301 ^(.*?)//+(.*)$ $1/$2

> But aren't there also analogous abuse possibilities with things like
> /././ and /.././ and so on?

Another good point. I thought my server was already redirecting those, but
apparently not -- it was the browser correcting my URLs for me.

Perhaps someone can debug these for me?

# Replace /./ with /
RedirectMatch 301 ^(.*?)/\./(.*)$ $1/$2

# Replace /../foo/bar with /foo/bar (at beginning of URI)
RedirectMatch 301 ^/\.\./(.*)$ /$1

# Replace /foo/../bar with /bar
RedirectMatch 301 ^(.*?)/[^/]+/\.\./(.*)$ $1\$2

Phil

Tim

unread,

Nov 1, 2005, 7:35:11 AM11/1/05

to

Tim:

>> It always has been, anyway. For numerous reasons. Your multiple slash
>> example is just one of them. Some robots will ignore them altogether,
>> others will deliberately look at what you tell them to ignore.

Guy Macon:

> The robots.txt protocol has always been ineffective on bad robots, but
> this is, as far as I know, the first example of it being ineffective on
> good robots.

I'm not so sure that it's a fault with robots.text. After all,
strangeness notwithstanding ///example isn't the same as /example.
Personally, I think this is an issue you'd need to deal with within the
server (e.g. filter requests to disallow access to URIs with multiple
concurrent slashes in them, rather than work around such conditions).

Dave0x01

unread,

Nov 2, 2005, 5:34:52 PM11/2/05

to

Borek wrote:

> On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <a...@example.com> wrote:
>
>
>>It's not clear exactly what the problem *is*. I've never seen a URL
>>with multiple adjacent forward slashes in my search results. Does
>>someone have an example?

<snip>

> All of these generated 404 in last few weeks on my site.
>
> No additional slashes inside of the url, although several times
> they were added at the end.
>
> & vs & and wrong capitalization (bate, casc instead of BATE, CASC)
> are most prominent sources of errors. But it seems every error is possible
> :)

Sorry, I should've been more clear. I wanted to know whether anyone
could point to an actual URL (e.g., a search query) demonstrating that
URLs with multiple adjacent forward slashes are actually being indexed
by any of the major search engines. I haven't seen one.

However, I don't think that the original poster was concerned with
whether these multiple slashed URLs appear in the index as such, so it's
probably not terribly important.

Dave

Dave0x01

unread,

Nov 2, 2005, 5:45:05 PM11/2/05

to

Guy Macon wrote:

A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

Dave

Dave0x01

unread,

Nov 2, 2005, 5:51:11 PM11/2/05

to

Philip Ronan wrote:

> "Dave0x1" wrote:
>
>
>>I don't understand why this is a big deal. The issue can be addressed
>>by numerous methods, including patching of the Apache web server source
>>code.
>
>
> OK, so as long as the robots.txt documentation includes a note saying that
> you have to patch your server software to get reliable results, then we'll
> all be fine.

I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.

>>It's not clear exactly what the problem *is*. I've never seen a URL

>>with multiple adjacent forward slashes in my search results. Does
>>someone have an example?
>
>

> Which bit didn't I explain properly? I'm not going to post a link for you to
> check, but here's the response I got from Google on the issue:
>
>
>>>Thank you for your note. We apologize for our delayed response.
>>>We understand you're concerned about the inclusion of
>>>http://###.####.###//contact/ in our index.

Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.

Dave

Big Bill

unread,

Nov 2, 2005, 7:47:44 PM11/2/05

to

Which is a good way of putting it.

Guy Macon

unread,

Nov 3, 2005, 1:37:49 AM11/3/05

to

A robots.txt file most certainly does decide which parts of a site
are indexed - by good robots. It offers suggestions that every good
robot obeys. The effect we are discussing someone else on the Internet
to override your good-robot spidering decisions as defined in robots.txt.

Philip Ronan

unread,

Nov 3, 2005, 5:49:20 AM11/3/05

to

"Dave0x01" wrote:

> Philip Ronan wrote:
>
>> "Dave0x1" wrote:
>>
>>> I don't understand why this is a big deal. The issue can be addressed
>>> by numerous methods, including patching of the Apache web server source
>>> code.
>>
>> OK, so as long as the robots.txt documentation includes a note saying that
>> you have to patch your server software to get reliable results, then we'll
>> all be fine.
>
> I wouldn't consider patching of the Apache source code either necessary
> or desirable in this situation.

I was being sarcastic. (You're American, right?)

> Does the URL in question appear in the index as
> <http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
> My assumption is the latter.

Then what the hell do you think this thread is all about??

For all you doubting Thomases out there:

Exhibit A: http://freespace.virgin.net/phil.ronan/junk/bad-google.png

Exhibit B: http://www.japanesetranslator.co.uk/robots.txt
(Last-Modified: Tue, 01 Mar 2005 08:45:29 GMT)

Bruce Lewis

unread,

Nov 3, 2005, 10:31:08 AM11/3/05

to

Philip Ronan <inv...@invalid.invalid> writes:

> "Nick Kew" wrote:
>
> > [please don't crosspost without warning. Or with inadequate context]
>
> My original post was copied over to ciwah, so now there are two threads with
> the same subject. I'm trying to tie them together, mkay?
>
> > Philip Ronan wrote:
> >>
> >> But retroactively adding to the robots.txt file every time someone posts a
> >> bad link to your site just isn't a practical solution.
> >
> > Who said anything about that?
>
> You did, in your earlier post: [quote]If you have links to things like
> "////" and dumb robots, put the paths in your robots.txt.[/quote]
>
> > What's impractical about "Disallow //" ?
>
> It's a partial solution. If you're trying to protect content at deeper
> levels in the hierarchy, you will also need:
>
> Disallow: /path//to/file
> Disallow: /path/to//file
> Disallow: /path//to//file
> Disallow: /path///to/file
> etc..

You do not want that kind of specificity in your robots.txt file.

Hostile robots will use robots.txt as a menu of "protected" pages to
crawl.

Here's what you want:

Disallow: /unlisted
Disallow: //

Then keep your unlisted contact info under /unlisted/contact/
or better, under /unlisted/86ghb3qx/

This is what I do for ourdoings.com family photo sites that aren't
intended for public view. Additionally I use the meta tags google
recommends. There's still the possibility of someone creating an
external link to the site, but having "unlisted" in the URL advises
people that although they can share it they shouldn't. If someone
creates such a link anyway, good search engines won't follow it.

Philip Ronan

unread,

Nov 3, 2005, 4:12:25 PM11/3/05

to

"Bruce Lewis" wrote:

> You do not want that kind of specificity in your robots.txt file.
>
> Hostile robots will use robots.txt as a menu of "protected" pages to
> crawl.
>
> Here's what you want:
>
> Disallow: /unlisted
> Disallow: //

Yeah, that'll work. I wasn't actually *recommending* putting every
conceivable combination of slashes into the robots.txt file, I was just
trying to point out that "Disallow: //" on its own is inadequate.

As long as you're aware of the problem and doing something about it, then
that's fine.

Message has been deleted

Big Bill

unread,

Nov 7, 2005, 4:33:49 AM11/7/05

to

On Mon, 07 Nov 2005 08:44:45 GMT, "D. Stussy"
<att-...@bde-arc.ampr.org> wrote:

>On Mon, 31 Oct 2005, Big Bill wrote:
>> >> I am still hoping that one of the .htaccess experts will come up
>> >> with a way to make all multiple-slash requests 301 redirect to
>> >> their single-slash versions.
>> >
>> >Trivial. Do it yourself.
>>

>> Umm, no, I think we'll hand the floor over to you at this point.
>
>Use the rewrite engine to compare the URI to the pattern "(.*)//(.*)", using
>apache's syntax. How to do this in other webservers is your problem.
>
>Since the URI is the resource pathname, it won't contain the double slash after
>the protocol and colon. I'm not here to spoon feed you.

Me you have to with stuff like that.

BB
--
www.kruse.co.uk/ s...@kruse.demon.co.uk
The buffalo have gone

Borek

unread,

Nov 7, 2005, 4:50:36 AM11/7/05

to

On Mon, 07 Nov 2005 09:38:30 +0100, D. Stussy <att-...@bde-arc.ampr.org>
wrote:

>> Quoting RFC1738 (BNF description of url):
(...)
>> So it seems you are wrong - multiple slashes in URLs are valid.

> However, this is usually further restricted by the filesystem naming
> conventions and that's where it's not proper.

Give reference. cd //user//local////www/data works under Linux
and FreeBSD, cd winnt\\cache works under Windows. They are not
restricted by filesystem so they are proper.

Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples

Borek

unread,

Nov 7, 2005, 4:53:55 AM11/7/05

to

On Mon, 07 Nov 2005 09:44:45 +0100, D. Stussy <att-...@bde-arc.ampr.org>
wrote:

> Since the URI is the resource pathname, it won't contain the double

> slash after the protocol and colon.

As I told - give reference. So far it is only PbBA (i.e. Proof by Bold
Assertion)

Guy Macon

unread,

Nov 9, 2005, 9:43:06 AM11/9/05

to

Borek wrote:

>D. Stussy wrote:
>
>>> Quoting RFC1738 (BNF description of url):
>(...)
>>> So it seems you are wrong - multiple slashes in URLs are valid.
>
>> However, this is usually further restricted by the filesystem naming
>> conventions and that's where it's not proper.
>
>Give reference. cd //user//local////www/data works under Linux
>and FreeBSD, cd winnt\\cache works under Windows. They are not
>restricted by filesystem so they are proper.

>> Since the URI is the resource pathname, it won't contain the double

>> slash after the protocol and colon.
>
>As I told - give reference. So far it is only PbBA (i.e. Proof by Bold
>Assertion)

Gosh, it sure got quiet all of a sudden... :)

Dave0x01

unread,

Nov 15, 2005, 10:00:22 PM11/15/05

to

Philip Ronan wrote:

> "Dave0x01" wrote:
>
>
>>Philip Ronan wrote:

>>>OK, so as long as the robots.txt documentation includes a note saying that
>>>you have to patch your server software to get reliable results, then we'll
>>>all be fine.
>>
>>I wouldn't consider patching of the Apache source code either necessary
>>or desirable in this situation.
>
>
> I was being sarcastic. (You're American, right?)

Yeah, I could tell. And I *wasn't* being sarcastic. What about my
comment do you think implies otherwise?

>>Does the URL in question appear in the index as
>><http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
>>My assumption is the latter.
>
>
> Then what the hell do you think this thread is all about??

[snip]

One could obviously be concerned about any number of things resulting
from the behavior described.