diff --git a/lib/rack/lint.rb b/lib/rack/lint.rb index 7eb0543..66d252b 100644 --- a/lib/rack/lint.rb +++ b/lib/rack/lint.rb @@ -88,7 +88,9 @@ module Rack ## within the application. This may be an ## empty string, if the request URL targets ## the application root and does not have a - ## trailing slash. + ## trailing slash. This information should be + ## decoded by the server if it comes from a + ## URL.
>> And the Rack spec should explicitly specify this behaviour, as the CGI >> spec does:
> Both applied, thanks.
I think the implementation and spec are at odds now, no? The spec says PATH_INFO should be decoded but the handlers all leave PATH_INFO encoded. Am I reading this wrong?
> I think the implementation and spec are at odds now, no? The spec says
> PATH_INFO should be decoded but the handlers all leave PATH_INFO
> encoded. Am I reading this wrong?
The current implementation is what makes sense to me. Without it, an
application wouldn't be able to tell the different between /foo%2Fbar/
and /foo/bar/ (which are semantically different)
However this may differ from CGI practice. Let me just test this with
an old-skool CGI under apache 2.2.8 (Ubuntu Hardy):
Interestingly, with this test, Firefox updated its URL bar to .../
foo*bar as well. However Apache logs show that the request was
received using %2A, and a 200 response was sent, not a redirect.
So it seems a bit of a mess.
Rack can specify whatever behaviour it likes, but the problem if we
say that handlers should *not* decode PATH_INFO is that in some cases
it may have already been done (e.g. when Rack is running as a CGI).
candlerb <b.cand...@pobox.com> writes: > Rack can specify whatever behaviour it likes, but the problem if we > say that handlers should *not* decode PATH_INFO is that in some cases > it may have already been done (e.g. when Rack is running as a CGI).
On Mar 11, 12:49 pm, Christian Neukirchen <chneukirc...@gmail.com>
wrote:
> candlerb <b.cand...@pobox.com> writes:
> > Rack can specify whatever behaviour it likes, but the problem if we
> > say that handlers should *not* decode PATH_INFO is that in some cases
> > it may have already been done (e.g. when Rack is running as a CGI).
> When would it be useful to have it not decoded?
/invoices/2009%2F1234/print
From RFC 3986:
"The purpose of reserved characters is to provide a set of
delimiting
characters that are distinguishable from other data within a URI.
URIs that differ in the replacement of a reserved character with
its
corresponding percent-encoded octet are not equivalent. Percent-
encoding a reserved character, or decoding a percent-encoded octet
that corresponds to a reserved character, will change how the URI
is
interpreted by most applications."
Or consider this:
helpers do
def build_path(*path_components)
path_components.map { |c| escape(c) }.join("/")
end
# If the path has already been decoded, we cannot
# implement the inverse function accurately:
def split_path(path)
path.split("/").map { |c| unescape(c) }
end
end
However there are sufficiently many broken HTTP implementations around
that can't parse this properly, that it would be unsurprising if Rack
were similarly broken. So I won't push too hard for it.
I'd promote on this piece to leave it encoded so that it isn't broken as standard. Following "standard convention" is nice, but I'd rather follow the standards.
On Wed, Mar 11, 2009 at 12:55, candlerb <b.cand...@pobox.com> wrote:
> On Mar 11, 12:49 pm, Christian Neukirchen <chneukirc...@gmail.com> > wrote: >> candlerb <b.cand...@pobox.com> writes: >> > Rack can specify whatever behaviour it likes, but the problem if we >> > say that handlers should *not* decode PATH_INFO is that in some cases >> > it may have already been done (e.g. when Rack is running as a CGI).
>> When would it be useful to have it not decoded?
> /invoices/2009%2F1234/print
> From RFC 3986:
> "The purpose of reserved characters is to provide a set of > delimiting > characters that are distinguishable from other data within a URI. > URIs that differ in the replacement of a reserved character with > its > corresponding percent-encoded octet are not equivalent. Percent- > encoding a reserved character, or decoding a percent-encoded octet > that corresponds to a reserved character, will change how the URI > is > interpreted by most applications."
> Or consider this:
> helpers do > def build_path(*path_components) > path_components.map { |c| escape(c) }.join("/") > end
> # If the path has already been decoded, we cannot > # implement the inverse function accurately: > def split_path(path) > path.split("/").map { |c| unescape(c) } > end > end
> However there are sufficiently many broken HTTP implementations around > that can't parse this properly, that it would be unsurprising if Rack > were similarly broken. So I won't push too hard for it.
On Thu, Mar 12, 2009 at 11:19 AM, Scytrin dai Kinthra <scyt...@gmail.com> wrote:
> I'd promote on this piece to leave it encoded so that it isn't broken > as standard. > Following "standard convention" is nice, but I'd rather follow the standards.
The Rack spec adopts the definitions of CGI in terms of what it passes in the request.
PATH_INFO comes from the CGI spec, and it says it should be decoded.
It also says a server MAY reject a request as invalid that has URL encoded '/' characters, because (as you point out), it causes loss of information.
The server MAY impose restrictions and limitations on what values it permits for PATH_INFO, and MAY reject the request with an error if it encounters any values considered objectionable. That MAY include any requests that would result in an encoded "/" being decoded into PATH_INFO, as this might represent a loss of information to the script.
Maybe the PATH_INFO should obey the CGI spec, but there should be a rack-specific env variable ("rack.path_info") that either doesn't url-decode the path?
Might be worth looking at wsapi to see what they do, probably a wealth of experience there.
> Maybe the PATH_INFO should obey the CGI spec, but there should be a
> rack-specific env variable ("rack.path_info") that either doesn't
> url-decode the path?
In that scenario, any middleware which alters PATH_INFO will also have
to be careful to make corresponding changes to rack.path_info. Another
option would be to have rack.path_info how we want it, and have a CGI
compat middleware which you can stick on the top of the stack just
below the application.
But let's reconsider the CGI definition, given that we are talking
only about the PATH_INFO portion. RFC 3986 allows non-reserved
characters to be unescaped. Also, the definition of path in section
3.3 allows sub-delims and : and @ to appear unencoded.
So the only characters which cannot appear unencoded are / ? # [ ]
The server will already have dealt with ? and # by trimming off the
query string and anchor.
As for [ and ]
"A host identified by an Internet Protocol literal address, version
6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax."
So actually, it's safe to unencode everything *except* %2F. This could
be achieved by:
When I say "safe" here I mean "unambiguous". If you were to use the
PATH_INFO to construct another URI, e.g. when proxying to another
server, you would have to remember that certain characters seen in
plain form in PATH_INFO must actually have been encoded in the
original request and therefore need re-encoding.
On Fri, Mar 13, 2009 at 2:30 AM, candlerb <b.cand...@pobox.com> wrote:
>> Maybe the PATH_INFO should obey the CGI spec, but there should be a >> rack-specific env variable ("rack.path_info") that either doesn't >> url-decode the path?
> In that scenario, any middleware which alters PATH_INFO will also have > to be careful to make corresponding changes to rack.path_info. Another > option would be to have rack.path_info how we want it, and have a CGI > compat middleware which you can stick on the top of the stack just > below the application.
> But let's reconsider the CGI definition, given that we are talking > only about the PATH_INFO portion. RFC 3986 allows non-reserved > characters to be unescaped. Also, the definition of path in section > 3.3 allows sub-delims and : and @ to appear unencoded.
> So the only characters which cannot appear unencoded are / ? # [ ]
> The server will already have dealt with ? and # by trimming off the > query string and anchor.
> As for [ and ]
> "A host identified by an Internet Protocol literal address, version > 6 > [RFC3513] or later, is distinguished by enclosing the IP literal > within square brackets ("[" and "]"). This is the only place where > square bracket characters are allowed in the URI syntax."
> So actually, it's safe to unencode everything *except* %2F. This could > be achieved by:
> When I say "safe" here I mean "unambiguous". If you were to use the > PATH_INFO to construct another URI, e.g. when proxying to another > server, you would have to remember that certain characters seen in > plain form in PATH_INFO must actually have been encoded in the > original request and therefore need re-encoding.
On Fri, Mar 13, 2009 at 2:30 AM, candlerb <b.cand...@pobox.com> wrote: >> Maybe the PATH_INFO should obey the CGI spec, but there should be a >> rack-specific env variable ("rack.path_info") that either doesn't >> url-decode the path?
> In that scenario, any middleware which alters PATH_INFO will also have > to be careful to make corresponding changes to rack.path_info. Another
Probably, though it might depend on why the middleware is modifying the env. I'd think doing so would generally be a bad idea. There is a lot of redundancy in the env as passed by Apache, anyway. Middleware doesn't have a good chance of meaningfully rewriting it all.
> So the only characters which cannot appear unencoded are / ? # [ ]
And %.
> So actually, it's safe to unencode everything *except* %2F. This could > be achieved by:
And %25, which used to arrive in the PATH_INFO decoded, so this seems to be an attempt to make handling / in path components unambiguously possible, at the expense of making % harder.
Also, how would you reconstruct the original URL from such a "partially encoded" PATH_INFO? This would break:
If rack just follows the CGI spec for CGI vars, and tries to present the original undecoded data elsewhere we have standard conformance and non-loss of data.
I totally sympathize with your goal of making the rack spec allow stuff you can theoretically do with HTTP, but I don't think partially encoded PATH_INFO will really help.
The app I'm working on relies on URL reconstruction. It also would benefit very much from being able to use a full URL as a path-component... but even though HTTP's escaping rules would allow that, its pretty clear that it's chance of working with actually deployed code is low.
But since I will never (famous last words) have more than a single URL in my path, anyway, I just dump it after the ? as the query info, which works fine:
This, btw, is how I found that the query info was being inject into the ARGV... I was getting server 500 errors and rackup complaining that "http://some.site.com/calendars/events.ics" was not a valid configuration, because it was ARGV[0], and rackup was trying to open it as a config file.
On Mar 13, 6:13 pm, Sam Roberts <vieuxt...@gmail.com> wrote:
> > In that scenario, any middleware which alters PATH_INFO will also have
> > to be careful to make corresponding changes to rack.path_info. Another
> Probably, though it might depend on why the middleware is modifying
> the env. I'd think doing so would generally be a bad idea.
Rack::URLMap is the canonical example.
> > So the only characters which cannot appear unencoded are / ? # [ ]
> And %.
You (and Ryan and Christian) are right of course. It really has to be
one thing or the other.
Aside: since this is Ruby we're talking about, we're not limited to
just strings. For example, PATH_INFO could be defined to be an array
of path components. Probably doesn't make life easier for anyone
though, compared with just having the original path available.
On Mar 11, 12:49 pm, Christian Neukirchen <chneukirc...@gmail.com>
wrote:
> candlerb <b.cand...@pobox.com> writes:
> > Rack can specify whatever behaviour it likes, but the problem if we
> > say that handlers should *not* decode PATH_INFO is that in some cases
> > it may have already been done (e.g. when Rack is running as a CGI).
> When would it be useful to have it not decoded?
I just came across a practical example of this.
Apache Couchdb <http://couchdb.apache.org/> provides a HTTP API. The
first component of the path is the database name. You are allowed to
specify a database name which includes slashes, but they must be
encoded as %2F. e.g.
candlerb <b.cand...@pobox.com> writes: > I just came across a practical example of this.
Since most webservers leave it with escapes and we have a patch to fix webrick to make it escaped as well, I reverted 7a3d21f4b469d5ce; web frameworks now have to escape for themselves.