Bug: WEBrick handler does not set the unescaped PATH_INFO

137 views
Skip to first unread message

Simon Chiang

unread,
Mar 8, 2009, 3:15:36 AM3/8/09
to Rack Development
For instance with this:

require 'rack'
app = lambda {|env| [200, {}, [env['PATH_INFO']]] }
Rack::Handler::WEBrick.run(app, :Port => 8080)

A request to 'http://localhost:8080/percent%3Aencoding' returns:

percent:encoding

Rather than:

percent%3Aencoding

The same is not true for Thin and Mongrel so I think it's a bug. I
fixed the issue in a fork of the rack repository:
http://github.com/bahuvrihi/rack/commit/f88976c314dbab84a001610996e5f69f4dad25eb

Sam Roberts

unread,
Mar 8, 2009, 1:52:22 PM3/8/09
to rack-...@googlegroups.com
On Sun, Mar 8, 2009 at 12:15 AM, Simon Chiang <simon.a...@gmail.com> wrote:
> A request to 'http://localhost:8080/percent%3Aencoding' returns:
>
>  percent:encoding
>
> Rather than:
>
>  percent%3Aencoding
>
> The same is not true for Thin and Mongrel so I think it's a bug.  I
> fixed the issue in a fork of the rack repository:
> http://github.com/bahuvrihi/rack/commit/f88976c314dbab84a001610996e5f69f4dad25eb

And the Rack spec should explicitly specify this behaviour, as the CGI
spec does:

http://hoohoo.ncsa.uiuc.edu/cgi/env.html

Cheers,
Sam

diff --git a/lib/rack/lint.rb b/lib/rack/lint.rb
index 7eb0543..66d252b 100644
--- a/lib/rack/lint.rb
+++ b/lib/rack/lint.rb
@@ -88,7 +88,9 @@ module Rack
## within the application. This may be an
## empty string, if the request URL targets
## the application root and does not have a
- ## trailing slash.
+ ## trailing slash. This information should be
+ ## decoded by the server if it comes from a
+ ## URL.

Christian Neukirchen

unread,
Mar 8, 2009, 5:37:44 PM3/8/09
to rack-...@googlegroups.com
Sam Roberts <vieu...@gmail.com> writes:

>> The same is not true for Thin and Mongrel so I think it's a bug.  I
>> fixed the issue in a fork of the rack repository:
>> http://github.com/bahuvrihi/rack/commit/f88976c314dbab84a001610996e5f69f4dad25eb
>
> And the Rack spec should explicitly specify this behaviour, as the CGI
> spec does:

Both applied, thanks.

--
Christian Neukirchen <chneuk...@gmail.com> http://chneukirchen.org

candlerb

unread,
Mar 10, 2009, 10:23:14 AM3/10/09
to Rack Development
In the context of Rack, would it be clearer to say "should be decoded
by the application" rather than "should be decoded by the server"?

Ryan Tomayko

unread,
Mar 10, 2009, 7:01:19 PM3/10/09
to rack-...@googlegroups.com
On Sun, Mar 8, 2009 at 2:37 PM, Christian Neukirchen
<chneuk...@gmail.com> wrote:
>
> Sam Roberts <vieu...@gmail.com> writes:
>
>>> The same is not true for Thin and Mongrel so I think it's a bug.  I
>>> fixed the issue in a fork of the rack repository:
>>> http://github.com/bahuvrihi/rack/commit/f88976c314dbab84a001610996e5f69f4dad25eb
>>
>> And the Rack spec should explicitly specify this behaviour, as the CGI
>> spec does:
>
> Both applied, thanks.

I think the implementation and spec are at odds now, no? The spec says
PATH_INFO should be decoded but the handlers all leave PATH_INFO
encoded. Am I reading this wrong?

Thanks,
Ryan

candlerb

unread,
Mar 11, 2009, 7:52:22 AM3/11/09
to Rack Development
> I think the implementation and spec are at odds now, no? The spec says
> PATH_INFO should be decoded but the handlers all leave PATH_INFO
> encoded. Am I reading this wrong?

The current implementation is what makes sense to me. Without it, an
application wouldn't be able to tell the different between /foo%2Fbar/
and /foo/bar/ (which are semantically different)

However this may differ from CGI practice. Let me just test this with
an old-skool CGI under apache 2.2.8 (Ubuntu Hardy):

#!/usr/bin/ruby
puts "Content-Type: text/plain"
puts
puts "PATH_INFO = #{ENV['PATH_INFO'].inspect}"

Hmm, strange.
http://localhost/cgi-bin/test-cgi
http://localhost/cgi-bin/test-cgi/foo
http://localhost/cgi-bin/test-cgi/foo/bar
all work as expected. But
http://localhost/cgi-bin/test-cgi/foo%2Fbar
gives a 404 error!

http://localhost/cgi-bin/test.cgi/foo%2Abar
does work, and gives a result of
PATH_INFO = "/foo*bar"

Interestingly, with this test, Firefox updated its URL bar to .../
foo*bar as well. However Apache logs show that the request was
received using %2A, and a 200 response was sent, not a redirect.

So it seems a bit of a mess.

Rack can specify whatever behaviour it likes, but the problem if we
say that handlers should *not* decode PATH_INFO is that in some cases
it may have already been done (e.g. when Rack is running as a CGI).

B.

Christian Neukirchen

unread,
Mar 11, 2009, 8:49:49 AM3/11/09
to rack-...@googlegroups.com
candlerb <b.ca...@pobox.com> writes:

> Rack can specify whatever behaviour it likes, but the problem if we
> say that handlers should *not* decode PATH_INFO is that in some cases
> it may have already been done (e.g. when Rack is running as a CGI).

When would it be useful to have it not decoded?

candlerb

unread,
Mar 11, 2009, 3:55:03 PM3/11/09
to Rack Development
On Mar 11, 12:49 pm, Christian Neukirchen <chneukirc...@gmail.com>
wrote:
> candlerb <b.cand...@pobox.com> writes:
> > Rack can specify whatever behaviour it likes, but the problem if we
> > say that handlers should *not* decode PATH_INFO is that in some cases
> > it may have already been done (e.g. when Rack is running as a CGI).
>
> When would it be useful to have it not decoded?

/invoices/2009%2F1234/print

From RFC 3986:

"The purpose of reserved characters is to provide a set of
delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with
its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI
is
interpreted by most applications."

Or consider this:

helpers do
def build_path(*path_components)
path_components.map { |c| escape(c) }.join("/")
end

# If the path has already been decoded, we cannot
# implement the inverse function accurately:
def split_path(path)
path.split("/").map { |c| unescape(c) }
end
end

However there are sufficiently many broken HTTP implementations around
that can't parse this properly, that it would be unsurprising if Rack
were similarly broken. So I won't push too hard for it.

Scytrin dai Kinthra

unread,
Mar 12, 2009, 2:19:44 PM3/12/09
to rack-...@googlegroups.com
I'd promote on this piece to leave it encoded so that it isn't broken
as standard.
Following "standard convention" is nice, but I'd rather follow the standards.

--
stadik.net

Sam Roberts

unread,
Mar 13, 2009, 12:09:36 AM3/13/09
to rack-...@googlegroups.com
On Thu, Mar 12, 2009 at 11:19 AM, Scytrin dai Kinthra <scy...@gmail.com> wrote:
> I'd promote on this piece to leave it encoded so that it isn't broken
> as standard.
> Following "standard convention" is nice, but I'd rather follow the standards.

The Rack spec adopts the definitions of CGI in terms of what it passes
in the request.

PATH_INFO comes from the CGI spec, and it says it should be decoded.

It also says a server MAY reject a request as invalid that has URL
encoded '/' characters, because (as you point out), it causes loss of
information.

The server MAY
impose restrictions and limitations on what values it permits for
PATH_INFO, and MAY reject the request with an error if it encounters
any values considered objectionable. That MAY include any requests
that would result in an encoded "/" being decoded into PATH_INFO, as
this might represent a loss of information to the script.

- http://www.ietf.org/rfc/rfc3875.txt, section 4.1.5

Maybe the PATH_INFO should obey the CGI spec, but there should be a
rack-specific env variable ("rack.path_info") that either doesn't
url-decode the path?


Might be worth looking at wsapi to see what they do, probably a wealth
of experience there.

Cheers,
Sam

candlerb

unread,
Mar 13, 2009, 5:30:32 AM3/13/09
to Rack Development
> Maybe the PATH_INFO should obey the CGI spec, but there should be a
> rack-specific env variable ("rack.path_info") that either doesn't
> url-decode the path?

In that scenario, any middleware which alters PATH_INFO will also have
to be careful to make corresponding changes to rack.path_info. Another
option would be to have rack.path_info how we want it, and have a CGI
compat middleware which you can stick on the top of the stack just
below the application.

But let's reconsider the CGI definition, given that we are talking
only about the PATH_INFO portion. RFC 3986 allows non-reserved
characters to be unescaped. Also, the definition of path in section
3.3 allows sub-delims and : and @ to appear unencoded.

So the only characters which cannot appear unencoded are / ? # [ ]

The server will already have dealt with ? and # by trimming off the
query string and anchor.

As for [ and ]

"A host identified by an Internet Protocol literal address, version
6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax."

So actually, it's safe to unencode everything *except* %2F. This could
be achieved by:

path.split(/%2F/i).map { |p| unencode(p) }.join("%2F")

When I say "safe" here I mean "unambiguous". If you were to use the
PATH_INFO to construct another URI, e.g. when proxying to another
server, you would have to remember that certain characters seen in
plain form in PATH_INFO must actually have been encoded in the
original request and therefore need re-encoding.

Ryan Tomayko

unread,
Mar 13, 2009, 7:38:10 AM3/13/09
to rack-...@googlegroups.com

Except there's this: "/foo%252Fbar".

Thanks,
Ryan

Christian Neukirchen

unread,
Mar 13, 2009, 9:58:16 AM3/13/09
to rack-...@googlegroups.com
Ryan Tomayko <r...@tomayko.com> writes:

> Except there's this: "/foo%252Fbar".

Escaping is hell on earth.

Sam Roberts

unread,
Mar 13, 2009, 2:13:43 PM3/13/09
to rack-...@googlegroups.com
On Fri, Mar 13, 2009 at 2:30 AM, candlerb <b.ca...@pobox.com> wrote:
>> Maybe the PATH_INFO should obey the CGI spec, but there should be a
>> rack-specific env variable ("rack.path_info") that either doesn't
>> url-decode the path?
>
> In that scenario, any middleware which alters PATH_INFO will also have
> to be careful to make corresponding changes to rack.path_info. Another

Probably, though it might depend on why the middleware is modifying
the env. I'd think doing so would generally be a bad idea. There is a
lot of redundancy in the env as passed by Apache, anyway. Middleware
doesn't have a good chance of meaningfully rewriting it all.

> So the only characters which cannot appear unencoded are / ? # [ ]

And %.

> So actually, it's safe to unencode everything *except* %2F. This could
> be achieved by:

And %25, which used to arrive in the PATH_INFO decoded, so this seems
to be an attempt to make handling / in path components unambiguously
possible, at the expense of making % harder.

Also, how would you reconstruct the original URL from such a
"partially encoded" PATH_INFO? This would break:

http://www.python.org/dev/peps/pep-0333/#url-reconstruction

If rack just follows the CGI spec for CGI vars, and tries to present
the original undecoded data elsewhere we have standard conformance and
non-loss of data.

I totally sympathize with your goal of making the rack spec allow
stuff you can theoretically do with HTTP, but I don't think partially
encoded PATH_INFO will really help.

The app I'm working on relies on URL reconstruction. It also would
benefit very much from being able to use a full URL as a
path-component... but even though HTTP's escaping rules would allow
that, its pretty clear that it's chance of working with actually
deployed code is low.

I wanted to do:

http://example.com/ics/http:%2f%2fsome.site.com%2fcalendars%2fevents.ics/atom

But since I will never (famous last words) have more than a single URL
in my path, anyway, I just dump it after the ? as the query info,
which works fine:

http://example.com/ics/atom?http://some.site.com/calendars/events.ics

And ends up easier to construct, anyway.

This, btw, is how I found that the query info was being inject into
the ARGV... I was getting server 500 errors and rackup complaining
that "http://some.site.com/calendars/events.ics" was not a valid
configuration, because it was ARGV[0], and rackup was trying to open
it as a config file.

Sam

Magnus Holm

unread,
Mar 13, 2009, 2:17:17 PM3/13/09
to rack-...@googlegroups.com
Indeed. Anyone knows how the frameworks handle this? Do they just unescape the whole PATH_INFO (Camping does at least) or do they do anything fancier?

//Magnus Holm

candlerb

unread,
Mar 15, 2009, 4:44:28 PM3/15/09
to Rack Development
On Mar 13, 6:13 pm, Sam Roberts <vieuxt...@gmail.com> wrote:
> > In that scenario, any middleware which alters PATH_INFO will also have
> > to be careful to make corresponding changes to rack.path_info. Another
>
> Probably, though it might depend on why the middleware is modifying
> the env. I'd think doing so would generally be a bad idea.

Rack::URLMap is the canonical example.

> > So the only characters which cannot appear unencoded are / ? # [ ]
>
> And %.

You (and Ryan and Christian) are right of course. It really has to be
one thing or the other.

Aside: since this is Ruby we're talking about, we're not limited to
just strings. For example, PATH_INFO could be defined to be an array
of path components. Probably doesn't make life easier for anyone
though, compared with just having the original path available.

Regards,

Brian.

candlerb

unread,
Mar 20, 2009, 12:37:30 PM3/20/09
to Rack Development
On Mar 11, 12:49 pm, Christian Neukirchen <chneukirc...@gmail.com>
wrote:
> candlerb <b.cand...@pobox.com> writes:
> > Rack can specify whatever behaviour it likes, but the problem if we
> > say that handlers should *not* decode PATH_INFO is that in some cases
> > it may have already been done (e.g. when Rack is running as a CGI).
>
> When would it be useful to have it not decoded?

I just came across a practical example of this.

Apache Couchdb <http://couchdb.apache.org/> provides a HTTP API. The
first component of the path is the database name. You are allowed to
specify a database name which includes slashes, but they must be
encoded as %2F. e.g.

http://127.0.0.1:5984/dev%2Fcustomers/...etc

If you do this, then it places the database file on disk under a
subdirectory hierarchy matching the database name, e.g.

/usr/local/var/lib/couchdb/dev/customers.couch
^^^^^^^^^^^^^

Christian Neukirchen

unread,
Mar 25, 2009, 9:26:49 AM3/25/09
to rack-...@googlegroups.com
candlerb <b.ca...@pobox.com> writes:

> I just came across a practical example of this.

Since most webservers leave it with escapes and we have a patch to fix
webrick to make it escaped as well, I reverted 7a3d21f4b469d5ce; web
frameworks now have to escape for themselves.

I clarified the SPEC accordingly.

Reply all
Reply to author
Forward
0 new messages