[Web-SIG] URL quoting in WSGI (or the lack therof)

108 views
Skip to first unread message

Ben Bangert

unread,
Jan 18, 2008, 9:02:38 PM1/18/08
to web...@python.org
I unfortunately couldn't find anything in the WSGI spec to indicate
whether or not I could expect environ variables relating to the URL to
be URL decoded when I get them or whether they reflect the raw URL
that was sent to the browser.

This recently became an issue, when a user noticed that the %2B URL
encoding for a + sign, had turned into a space when it hit their app.
Sure enough, Paste was doing URL un-quoting, then Routes, and the
double URL un-quote resulted in the + being a space.

Is there some definitive word on whether a WSGI application should
expect to have it URL un-quoted or not?

Cheers,
Ben

Robert Brewer

unread,
Jan 18, 2008, 10:07:36 PM1/18/08
to Ben Bangert, web...@python.org

The last time I asked that question here [1], Phillip kindly pointed out
to me that that's defined by the CGI spec. I could go through the agony
of distributed English-obfuscated BNF analysis again, but I'll just note
that I changed CP's wsgiserver to do decoding that very day. So I think
the answer is "yes".


Robert Brewer
fuma...@aminus.org

[1] http://mail.python.org/pipermail/web-sig/2006-August/002230.html

_______________________________________________
Web-SIG mailing list
Web...@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/python-web-sig-garchive-9074%40googlegroups.com

Luis Bruno

unread,
Jan 19, 2008, 9:38:09 AM1/19/08
to web...@python.org
Hello y'all, delurking,

I'm using a /-delimited path, %-encoding each literal '/' appearing in
the path segments. I was not amused to see egg:Paste#http urldecoding
the whole PATH_INFO.

Ben Bangert wrote:
> This recently became an issue, when a user noticed that the %2B URL
> encoding for a + sign, had turned into a space when it hit their app.

A swift monkey-patch to
paste.httpserver.py:WSGIHandlerMixin.wsgi_setup() later, and
ORIGINAL_PATH_INFO is part of the WSGI spec in my world. The following
URL now Does The Right Thing:

http://127.0.0.1:5000/catalog/NEC/Computers/Laptops/LN500%2F9DW/


Rober Brewer wrote:
> I changed CP's wsgiserver to do decoding that very day. So I think the
> answer is "yes".

IMHO "yes" is the wrong answer; I am also very unsure about what is the
right answer. I have to walk [urldecode(segment) for segment in
ORIGINAL_PATH_INFO.split('/')]; this doesn't look like the Right Answer
to me anyway.

--
Luís Bruno

Robert Brewer

unread,
Jan 19, 2008, 2:13:36 PM1/19/08
to Luis Bruno, web...@python.org
Luis Bruno wrote:
> I'm using a /-delimited path, %-encoding each literal '/' appearing in
> the path segments. I was not amused to see egg:Paste#http urldecoding
> the whole PATH_INFO.

All HTTP URI are /-delimited, and any '/' appearing in a single segment
that is not intended to participate in the hierarchy semantics must be
%-encoded before transmitting it over HTTP. I think that's what you're
saying above, but I don't understand why decoding on the server or
gateway is a problem. Perhaps you could expand on that: when you say
"I'm using", where is that? Inside a WSGI application?

> Ben Bangert wrote:
> > This recently became an issue, when a user noticed that the %2B URL
> > encoding for a + sign, had turned into a space when it hit their
app.
>
> A swift monkey-patch to
paste.httpserver.py:WSGIHandlerMixin.wsgi_setup()
> later, and ORIGINAL_PATH_INFO is part of the WSGI spec in my world.
> The following URL now Does The Right Thing:
>
> http://127.0.0.1:5000/catalog/NEC/Computers/Laptops/LN500%2F9DW/

Platonic Capital Letters won't get you very far with this crowd. You
have to explain why you think the application should receive %XX encoded
URI's instead of decoded ones. What's the benefit? I only see a con:
every piece of middleware that cares has to repeat the decoding of
PATH_INFO and SCRIPT_NAME, wasting CPU and memory.

> Robert Brewer wrote:
> > I changed CP's wsgiserver to do decoding that very day.
> > So I think the answer is "yes".
>
> IMHO "yes" is the wrong answer

Why?

> I am also very unsure about what is the right answer.

According to [1], the right answer is "yes":

The PATH_INFO metavariable specifies a path to be interpreted
by the CGI script. It identifies the resource or sub-resource
to be returned by the CGI script, and it is derived from the
portion of the URI path following the script name but preceding
any query data. The syntax and semantics are similar to a
decoded HTTP URL 'path' token (defined in RFC 2396 [4]), with
the exception that a PATH_INFO of "/" represents a single void
path segment.


Robert Brewer
fuma...@aminus.org

[1] http://cgi-spec.golux.com/draft-coar-cgi-v11-03-clean.html#6.1.6

Luis Bruno

unread,
Jan 21, 2008, 6:06:27 AM1/21/08
to web...@python.org
I'll top post my "solution"; scare quoted because I'm still not sure
this is the smartest idea:
environ['wsgiorg.path-segments'] = ['catalog', 'NEC', 'Computers',
'Laptop', 'LN500/9DW']

Robert Brewer wrote:
> All HTTP URI are /-delimited, and any '/' appearing in a single segment
> that is not intended to participate in the hierarchy semantics must be
> %-encoded before transmitting it over HTTP.

I wholeheartedly agree. And your explanation is clearer than mine.
>> IMHO [changing CP's wsgiserver to do decoding] is the wrong answer
> Why?
>
Because then I'm stuck monkey patching every WSGI server (and/or stuck
using my own URL dispatcher) so that I don't lose the information that
one of the forward slashes is NOT a path delimiter. You said that
%-encoding is meant for slashes not participating in hierarchy
semantics, if I read you correctly; so I think you'll agree with me on this.


> You have to explain why you think the application should receive %XX encoded
> URI's instead of decoded ones. What's the benefit? I only see a con:
> every piece of middleware that cares has to repeat the decoding of
> PATH_INFO and SCRIPT_NAME, wasting CPU and memory.
>

I was aware of this trade off, which is why I'm still not sure the
application should receive the %-encoded URIs. My app was forced to
split the URL on the '/' delimiters. If I can get the framework to do
that job while dispatching, so much the better. Hence the solution I top
posted. My problem rises when I output a link created from suitably
%-encoding these path segments:

'/'.join(['NEC', 'Computers', 'Laptop', 'LN500/9DW'])

And after the user clicks that link, the framework gives me (and Routes
has no way to avoid this when Paste is the one who's doing the whole
path decoding):

['NEC', 'Computers', 'Laptop', 'LN500', '9DW']

Think dispatching to a ``callable(*segments, **urlvariables)``. I think
we'll agree this is not what the app writer intended. And I'm out of
luck if the WSGI server/dispatcher is the one doing the urldecoding.


> According to [1], the right answer is "yes":
>

I'll see your CGI draft and raise you the URI spec[2]. When you've read
the last sentence, you'll see how unoriginal the top posted solution was:
> 2.4.2. When to Escape and Unescape
>
> A URI is always in an "escaped" form, since escaping or unescaping a
> completed URI might change its semantics. Normally, the only time
> escape encodings can safely be made is when the URI is being created
> from its component parts; each component may have its own set of
> characters that are reserved, so only the mechanism responsible for
> generating or interpreting that component can determine whether or
> not escaping a character will change its semantics. Likewise, a URI
> must be separated into its components before the escaped characters
> within those components can be safely decoded.
[1] http://cgi-spec.golux.com/draft-coar-cgi-v11-03-clean.html#6.1.6
[2] <URL:http://www.ietf.org/rfc/rfc2396.txt>. There is a CGI
Informational RFC somewhere, which I've read diagonally coming here to
grumble.

--
Luís Bruno

Ian Bicking

unread,
Jan 20, 2008, 8:30:20 PM1/20/08
to Luis Bruno, web...@python.org
Luis Bruno wrote:
> Hello y'all, delurking,
>
> I'm using a /-delimited path, %-encoding each literal '/' appearing in
> the path segments. I was not amused to see egg:Paste#http urldecoding
> the whole PATH_INFO.

Unfortunately this is in the WSGI spec, so it's not Paste#http so much
as WSGI that demands this.

I think in the CGI implementations this is kind of handled by
REQUEST_URI containing the quoted value. But relating REQUEST_URI with
SCRIPT_NAME/PATH_INFO is awkward and having the information in duplicate
places can lead to errors and unclear situations if they don't match up
properly.

> Ben Bangert wrote:
>> This recently became an issue, when a user noticed that the %2B URL
>> encoding for a + sign, had turned into a space when it hit their app.
> A swift monkey-patch to
> paste.httpserver.py:WSGIHandlerMixin.wsgi_setup() later, and
> ORIGINAL_PATH_INFO is part of the WSGI spec in my world. The following
> URL now Does The Right Thing:
>
> http://127.0.0.1:5000/catalog/NEC/Computers/Laptops/LN500%2F9DW/

It would be the Right Thing, except for not being WSGI. I made note of
this issue on the WSGI 2.0 ideas page, but I don't think anyone
(including myself) has proposed any good resolution. Diverging from CGI
and leaving PATH_INFO/SCRIPT_NAME quoted would work. But it's libel to
lead to bugs as it's a fairly subtle thing and for most applications the
semantics won't change and people won't realize their code is broken for
some corner case. I suppose we could remove SCRIPT_NAME and PATH_INFO
entirely and replace them with new keys.

Ian

Robert Brewer

unread,
Jan 21, 2008, 3:01:27 PM1/21/08
to Luis Bruno, web...@python.org
Luis Bruno wrote:

> Robert Brewer wrote:
> > > IMHO [changing CP's wsgiserver to do decoding] is the wrong answer
> > Why?
> >
> Because then I'm stuck monkey patching every WSGI server (and/or stuck
> using my own URL dispatcher) so that I don't lose the information that
> one of the forward slashes is NOT a path delimiter. You said that
> %-encoding is meant for slashes not participating in hierarchy
> semantics, if I read you correctly; so I think you'll agree with me on
> this.

Ah. Now I see. We've had a test case for this since Nov 2005 [1]. FWIW,
CherryPy took the option of special-casing forward slashes; those are
the only characters which are *not* decoded--they are left as %2F
characters in SCRIPT_NAME and PATH_INFO [2]:

# Unquote the path+params (e.g. "/this%20path" -> "this path").
# http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2
#
# But note that "...a URI must be separated into its components
# before the escaped characters within those components can be
# safely decoded." http://www.ietf.org/rfc/rfc2396.txt, sec 2.4.2
atoms = [unquote(x) for x in quoted_slash.split(path)]
path = "%2F".join(atoms)
environ["PATH_INFO"] = path

...and CherryPy then decodes these on the WSGI-app-side, only after the
dispatching step (to produce "virtual path" atoms) [3]:

if func:
# Decode any leftover %2F in the virtual_path atoms.
vpath = [x.replace("%2F", "/") for x in vpath]
request.handler = LateParamPageHandler(func, *vpath)
else:
request.handler = cherrypy.NotFound()

You're absolutely right; it would be nice to standardize a solution to
this. Of course, I'm going to propose we standardize *our* solution. ;)

> I'll see your CGI draft and raise you the URI spec.

Heh. Quoted in the code comments above.


Robert Brewer
fuma...@aminus.org

[1] cf http://www.cherrypy.org/ticket/393
[2]
http://www.cherrypy.org/browser/trunk/cherrypy/wsgiserver/__init__.py#L3
14
[3] http://www.cherrypy.org/browser/trunk/cherrypy/_cpdispatch.py#L71

Luis Bruno

unread,
Jan 22, 2008, 6:25:50 AM1/22/08
to web...@python.org

Ian Bicking wrote:
> But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and
> having the information in duplicate places can lead to errors and
> unclear situations if they don't match up properly.

True, and you can apply the same reasoning to my suggestion too.

Apart from the duplication of information, there's how or where to do
the actual decoding. Not everyone is dispatching to a CherryPy-style
tree of objects, so putting a %-decoded list of path segments in a
environ key doesn't work -- I knew it was a bad idea! I'm going with
CherryPy's on this: don't decode "%2F". Should other characters be kept
encoded?

Also, this crystallizes my thoughts on the matter: %-decoding is the
applications' job. Or frameworks'. *Not* the servers'.


> Luis Bruno wrote:
>> I was not amused to see egg:Paste#http urldecoding the whole PATH_INFO.
> Unfortunately this is in the WSGI spec, so it's not Paste#http so much
> as WSGI that demands this.

Cite?

I skimmed PEP 333 before grumbling and I've just re-read it; didn't find
it, unless you're referring to the code in "URL Reconstruction" section.
If you're referring[*] to the CGI 1.1 draft linked in "environ
Variables", I think it supports my position that unquoting(PATH_INFO)
was not the correct thing to do.

[*] I'm not sure how to spell that.


> I made note of this issue on the WSGI 2.0 ideas page

Didn't find it here: <URL:http://wsgi.org/wsgi/WSGI_2.0>. Should I look
elsewhere?


> [/Laptops/LN500%2F9DW/ ] would be the Right Thing, except for not
> being WSGI.
Looks to me like a good candidate for an amendment.


What's the next step?
--
Luís Bruno

Sven Berkvens-Matthijsse

unread,
Jan 22, 2008, 6:47:47 AM1/22/08
to Luis Bruno, web...@python.org
Luís Bruno wrote:
> Ian Bicking wrote:
> > But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and
> > having the information in duplicate places can lead to errors and
> > unclear situations if they don't match up properly.
>
> True, and you can apply the same reasoning to my suggestion too.
>
> Apart from the duplication of information, there's how or where to
> do the actual decoding. Not everyone is dispatching to a
> CherryPy-style tree of objects, so putting a %-decoded list of path
> segments in a environ key doesn't work -- I knew it was a bad idea!
> I'm going with CherryPy's on this: don't decode "%2F". Should other
> characters be kept encoded?

Yes, in my opinion all encoded character should remain encoded.
Otherwise, a path like /whatever/some%252Fthing/blah/ would become
(after decoding): /whatever/some%2Fthing/blah/ which is certainly not
what you'd want and/or expect.

> Also, this crystallizes my thoughts on the matter: %-decoding is the
> applications' job. Or frameworks'. *Not* the servers'.

I absolutely agree on this. The application is the only entity that
knows how to interpret the (remainder of the) URI properly.

> --
> Luís Bruno

--
het internet begint bij ilse tel: 040 219 32 00
Sven Berkvens-Matthijsse fax: 040 219 32 99
sv...@ilse.net url: http://ilse.nl/

Luis Bruno

unread,
Jan 22, 2008, 11:29:09 AM1/22/08
to web...@python.org
James Y Knight escreveu:
> FWIW, I think the right thing for a server to do is to reject any URLs
> going to a wsgi (or cgi) script with a %2F in it. I believe this is
> what apache's CGI host does.
You'd reject the following URL?
http://localhost:5000/catalog/NEC/Laptops/LN500%2F9DW/

BTW, I make a beautiful breadcrumb trail out of that:
Home > Catalog > NEC > Laptops > *LN500/9DW*

> BTW, for extra fun, you should be considering ";" too.
True. The urlparse/urlsplit docs mention ';' but I don't understand
where/how it's used.

James Y Knight

unread,
Jan 22, 2008, 11:02:22 AM1/22/08
to Sven Berkvens-Matthijsse, web...@python.org

On Jan 22, 2008, at 6:47 AM, Sven Berkvens-Matthijsse wrote:

> Luís Bruno wrote:
>> Ian Bicking wrote:
>>> But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and
>>> having the information in duplicate places can lead to errors and
>>> unclear situations if they don't match up properly.
>>
>> True, and you can apply the same reasoning to my suggestion too.
>>
>> Apart from the duplication of information, there's how or where to
>> do the actual decoding. Not everyone is dispatching to a
>> CherryPy-style tree of objects, so putting a %-decoded list of path
>> segments in a environ key doesn't work -- I knew it was a bad idea!
>> I'm going with CherryPy's on this: don't decode "%2F". Should other
>> characters be kept encoded?
>
> Yes, in my opinion all encoded character should remain encoded.
> Otherwise, a path like /whatever/some%252Fthing/blah/ would become
> (after decoding): /whatever/some%2Fthing/blah/ which is certainly not
> what you'd want and/or expect.

Your opinion is irrelevant, this is specified by the CGI spec. Yes,
agreed, it's not the best spec ever, but there's nothing you can do
about that. FWIW, I think the right thing for a server to do is to

reject any URLs going to a wsgi (or cgi) script with a %2F in it. I
believe this is what apache's CGI host does.

BTW, for extra fun, you should be considering ";" too.

James

Brian Smith

unread,
Jan 22, 2008, 11:44:43 AM1/22/08
to Web SIG
Luis Bruno wrote:
> Ian Bicking wrote:
> > But relating REQUEST_URI with SCRIPT_NAME/PATH_INFO is awkward and
> > having the information in duplicate places can lead to errors and
> > unclear situations if they don't match up properly.

I don't understand this argument. WSGI gateways just need to parse the
request URL correctly, and then everything *will* match up correctly,
AFAICT. Providing an undecoded REQUEST_URI that an application can parse
on its own is much better than what CherryPy is doing, and it is useful
for other reasons as well.

> I'm going with CherryPy's on this: don't decode "%2F".

CherryPy is not implementing the WSGI 1.0 specification correctly. And,
CherryPy's behavior here is harmful, because applications have no way of
knowing whether "%2F" is an un-decoded slash, or a literal "%2F".

> > Luis Bruno wrote:
> >> I was not amused to see egg:Paste#http urldecoding the
> >> whole PATH_INFO.
> > Unfortunately this is in the WSGI spec, so it's not
> > Paste#http so much as WSGI that demands this.
>

> I skimmed PEP 333 before grumbling and I've just re-read it;
> didn't find it, unless you're referring to the code in "URL
> Reconstruction" section.
> If you're referring[*] to the CGI 1.1 draft linked in "environ
> Variables", I think it supports my position that unquoting(PATH_INFO)
> was not the correct thing to do.

PEP 333 defers the definition of PATH_INFO to the CGI specification:
"The environ dictionary is required to contain these CGI environment
variables, as defined by the Common Gateway Interface specification
[2]". That version of the CGI specification clearly expects PATH_INFO be
to decoded. Section 3.2 says "'enc-path-info' is a URL-encoded version
of PATH_INFO". The implication is that PATH_INFO is *not* URL-encoded.
Section 6.1.6 is more explicit, saying: "The syntax and semantics are


similar to a decoded HTTP URL 'path' token (defined in RFC 2396 [4]),
with the exception that a PATH_INFO of "/" represents a single void path
segment."

Furthermore, the URL reconstruction section and the CGI WSGI gateway
both also imply that PATH_INFO has already been decoded.

> > [/Laptops/LN500%2F9DW/ ] would be the Right Thing, except for not
> > being WSGI.
> Looks to me like a good candidate for an amendment.
>
> What's the next step?

Something so fundemantal as this cannot be changed with a simple
ammendment to the existing specification. Such a change would break
currently-conforming gateways and applications. An ammendment that
recommends, but does not require, REQUEST_URI is a much better option.

- Brian

Luis Bruno

unread,
Jan 22, 2008, 1:02:19 PM1/22/08
to web...@python.org
Brian Smith wrote:
> An ammendment that recommends, but does not require, REQUEST_URI is a
> much better option.

Thereby forcing me to shop around for a WSGI server that actually puts
the recommendation into practice? Because I want to keep my %-encoded
characters? Which I encoded for, you know, escaping them from the usual
processing? Smells of mistake.

This sub-thread starts with me putting an ORIGINAL_PATH_INFO into the
environ, which the dispatch code doesn't touch. This forces me to strip
the app mount points, reinventing Paste#urlmap. Should REQUEST_URI be
touched by dispatch code? If so, PATH_INFO has no use. If not, the
duplication Ian Bicking mentioned comes into play.

> That version of the CGI specification clearly expects PATH_INFO to be decoded.

I agree; I think you should refer to the top of page 14 in RFC 3875,
instead of to the 1999 draft. The draft didn't outright forbid multiple
path-segments like the RFC does, but was ambiguous enough (your quote):

> Section 6.1.6 is more explicit, saying: "The syntax and semantics are
> similar to a decoded HTTP URL 'path' token (defined in RFC 2396 [4])
>

Don't forget to read the %-decoding rules in RFC 2396's section 2.4.2 if
you're going to quote "decoded HTTP URL 'path' token".

Fortunately, the URI spec doesn't repeat the mistake of forbidding
%-encoding characters. It does mention that each path-segment should be
separately %-decoded, going against the CGI spec which actually forbids
multiple segments *in PATH_INFO*. That smells of mistake. Faced with the
choice between those specs, I'd prefer not to lose information for
mindless compliance with CGI.


--
Luís Bruno

Brian Smith

unread,
Jan 22, 2008, 1:34:24 PM1/22/08
to Web SIG
Luis Bruno wrote:
> Brian Smith wrote:
> > An ammendment that recommends, but does not require,
> > REQUEST_URI is a much better option.
>
> Thereby forcing me to shop around for a WSGI server that
> actually puts the recommendation into practice? Because I
> want to keep my %-encoded characters? Which I encoded for,
> you know, escaping them from the usual processing? Smells of
> mistake.

You already have to shop around for a WSGI server that can distinguish
between encoded and unencoded slashes in PATH_INFO, because the WSGI
specification doesn't require the WSGI gateway to distinguish between
them.

I agree that the WSGI 1.0 specification is not good in this regard.
However, because an application cannot detect whether PATH_INFO has been
decoded or not, the only reasonable thing that it can do is to assume
that the gateway and middleware are following the WSGI specification.
The corollary is that applications shouldn't rely on being able to
distinguish between "%2F" and "/" based on PATH_INFO if it wants to be
portable.

If you really want PATH_INFO to have "%2F" instead of "/", then I
suggest encoding the slashes as "%252F" or "$2F" or something else. Then
your application will be portable.

> This sub-thread starts with me putting an ORIGINAL_PATH_INFO
> into the environ, which the dispatch code doesn't touch. This
> forces me to strip the app mount points, reinventing
> Paste#urlmap. Should REQUEST_URI be touched by dispatch code?
> If so, PATH_INFO has no use. If not, the duplication Ian
> Bicking mentioned comes into play.

By definition, the Request URI doesn't change during a request. So,
REQUEST_URI shouldn't fiddled with by dispatching code, unlike
SCRIPT_NAME and PATH_INFO. Usually, the dispatching code is just
shifting segments of PATH_INFO into SCRIPT_NAME, but SCRIPT_NAME joined
with PATH_INFO and the QUERY_STRING is always constant. So, the problems
with ORIGINAl_PATH_INFO don't apply to REQUEST_URI.

> > That version of the CGI specification clearly expects
> > PATH_INFO to be decoded.
>
> I agree; I think you should refer to the top of page 14 in
> RFC 3875, instead of to the 1999 draft. The draft didn't
> outright forbid multiple path-segments like the RFC does, but
> was ambiguous enough (your quote):

PEP 333 defers the definition of PATH_INFO to the 1999 draft, not to RFC
3875. So, it doesn't matter what RFC 3875 says.

> Fortunately, the URI spec doesn't repeat the mistake of
> forbidding %-encoding characters. It does mention that each
> path-segment should be separately %-decoded, going against
> the CGI spec which actually forbids multiple segments *in
> PATH_INFO*. That smells of mistake. Faced with the choice
> between those specs, I'd prefer not to lose information for
> mindless compliance with CGI.

I don't care about CGI compatibility. I do depend on WSGI gateways being
compliant with the WSGI specification.

- Brian

Luis Bruno

unread,
Jan 22, 2008, 2:04:38 PM1/22/08
to web...@python.org
Brian Smith wrote:
> If you really want PATH_INFO to have "%2F" instead of "/", then I
> suggest encoding the slashes as "%252F" or "$2F" or something else.
> Then your application will be portable.

I need those '/'. They are the canonical hierarchical delimiters. They
are also present in some model names. So yeah, "$2F" might work. I was
originally using "!" which isn't used in any model name on my catalog.
Please don't read acquiescence into the previous phrase; thinking of
escaping escape-chars reeks of stupidity: I can't show this off to my
programmer boss, and expect him to quietly accept my judgment without
serious amount of explanation.


> PEP 333 defers the definition of PATH_INFO to the 1999 draft
>

True. Please keep in mind that the CGI draft also references the URI
syntax spec, which I'll read as supporting my position.


> I do depend on WSGI gateways being compliant with the WSGI specification.
>

We all do, which is why I'm here wasting electrons and everyone's time.


Thank you,
--
Luís Bruno

James Y Knight

unread,
Jan 22, 2008, 2:22:07 PM1/22/08
to Web SIG

On Jan 22, 2008, at 1:02 PM, Luis Bruno wrote:

>
> Fortunately, the URI spec doesn't repeat the mistake of forbidding
> %-encoding characters. It does mention that each path-segment should
> be
> separately %-decoded, going against the CGI spec which actually
> forbids
> multiple segments *in PATH_INFO*. That smells of mistake. Faced with
> the
> choice between those specs, I'd prefer not to lose information for
> mindless compliance with CGI.
>

Where does the CGI spec forbid multiple segments in PATH_INFO? It
doesn't. It actually says that PATH_INFO is made by joining each
decoded path-segment with a /. And as far as I know /every/ extant
implementation does this. And the high quality ones forbid a / from
appearing in the decoded segment (aka, from a %2F in the original
url), in order to avoid security issues.

So I'm not sure what this thread is about. You can argue that the CGI
spec has a bug in it, but it's not like this is a new issue or
something, and it's shared by every system based on CGI. (PHP for
example has the same issue).

Besides, the workaround is quite simple: don't use %2F characters in
your urls.

James

Luis Bruno

unread,
Jan 22, 2008, 5:33:57 PM1/22/08
to web...@python.org
Ian Bicking pointed at CGI 1.1 saying: "See? The WSGI spec tells me to
do this!" And he's right. This sub-thread is about *me* thinking the
*WSGI spec* should be *fixed*.


James Y Knight wrote:
> Where does the CGI spec forbid multiple segments in PATH_INFO?
> It doesn't. It actually says that PATH_INFO is made by joining each
> decoded path-segment with a /.

My fault. I misread this:

The server MAY reject the request with an error if it encounters
any values considered objectionable. That MAY include any requests
that would result in an encoded "/" being decoded into PATH_INFO, as
this might represent a loss of information to the script.

Still, my problem is that "loss of information"; I no longer know
which '/' were %-encoded.


> And as far as I know /every/ extant implementation does this.

As does Paste#http. My fault for not reading correctly.


> Besides, the workaround is quite simple: don't use %2F characters in your urls.

Should I use $2F? I already *have* an escaping mechanism... which I'm
using for spaces, BTW. Why can't I use it for slashes? I came to
web-sig@ to fix the spec, not to find a workaround. I already *have* a
workaround: it starts with me monkeying around Paste#http and rolling
my own dispatcher. Not too bright though, as I could have slapped a
$2F in there for a quick workaround (thank you Brian).

A quick sanity check here: I think
http://host/catalog/some%2Fthing/shallow/ is *meant* to have two
nested levels: "some/thing" and "shallow". Is it obvious to you to
interpret the URL as having three nested levels "some", "thing" and
"shallow"? I ask because the first choice is very obvious to me; I'm
treating the second one (current behaviour) as a bug to be fixed.


Anyone else thinks it's a bug in WSGI too?
--
Luis Bruno

James Y Knight

unread,
Jan 22, 2008, 6:21:59 PM1/22/08
to Web SIG
On Jan 22, 2008, at 5:33 PM, Luis Bruno wrote:
> A quick sanity check here: I think
> http://host/catalog/some%2Fthing/shallow/ is *meant* to have two
> nested levels: "some/thing" and "shallow". Is it obvious to you to
> interpret the URL as having three nested levels "some", "thing" and
> "shallow"? I ask because the first choice is very obvious to me; I'm
> treating the second one (current behaviour) as a bug to be fixed.

You're right, it certainly shouldn't be interpreted as the same URL as
some/thing/shallow. That is most likely an avenue for a security
exploit if your server does so, and the server should likely be fixed.
However, as there is simply no way to represent "some%2Fthing/
shallow/" with PATH_INFO, as specified in the CGI spec, the only
alternative is to reject the request. This is what the major servers
do today.

> Anyone else thinks it's a bug in WSGI too?


WSGI is based upon CGI and inherits this behavior. I suppose a WSGI-
specific fix could be done. However, there are good reasons for
inheriting behavior from CGI, most importantly, ease of integration.
Servers already implement this behavior for CGI SCGI FastCGI PHP, and
now, WSGI. None of the previous have seen it as important enough an
issue to change this behavior, and neither do I think it important
enough for WSGI.

So, no, I don't consider it a bug in WSGI. You could call it a bug in
CGI if you like. Good luck getting it changed.

James

Robert Brewer

unread,
Jan 23, 2008, 12:15:58 PM1/23/08
to James Y Knight, Web SIG
James Y Knight wrote:
> ...as there is simply no way to represent "some%2Fthing/

> shallow/" with PATH_INFO, as specified in the CGI spec, the only
> alternative is to reject the request. This is what the major servers
> do today.
>
> > Anyone else thinks it's a bug in WSGI too?
>
> WSGI is based upon CGI and inherits this behavior. I suppose a WSGI-
> specific fix could be done. However, there are good reasons for
> inheriting behavior from CGI, most importantly, ease of integration.
> Servers already implement this behavior for CGI SCGI FastCGI PHP, and
> now, WSGI. None of the previous have seen it as important enough an
> issue to change this behavior, and neither do I think it important
> enough for WSGI.
>
> So, no, I don't consider it a bug in WSGI. You could call it a bug in
> CGI if you like. Good luck getting it changed.

I consider it a bug in both, and the difficulty level of changing the
CGI behavior really has no bearing on our decision to do better with
WSGI. I think it's important that we allow the full range of URI's to be
accepted. If you go and stick Apache in front of your WSGI app, it will
still 404, sure; but that's your choice to use Apache or not. There's no
sense making WSGI a least common denominator, inheriting all the
limitations of all the existing web servers.


Robert Brewer
fuma...@aminus.org

Phillip J. Eby

unread,
Jan 23, 2008, 1:18:38 PM1/23/08
to Robert Brewer, James Y Knight, Web SIG
At 09:15 AM 1/23/2008 -0800, Robert Brewer wrote:
>I consider it a bug in both, and the difficulty level of changing the
>CGI behavior really has no bearing on our decision to do better with
>WSGI. I think it's important that we allow the full range of URI's to be
>accepted. If you go and stick Apache in front of your WSGI app, it will
>still 404, sure; but that's your choice to use Apache or not. There's no
>sense making WSGI a least common denominator, inheriting all the
>limitations of all the existing web servers.

Uh, actually, that's sort of the whole point of WSGI - to allow
portable applications. If the spec allows you to do something in
theory that's almost never allowed in practice, that's not very helpful.

I don't consider WSGI's CGI compatibility on this point to be an
error, in other words. An application that expects to receive
encoded URLs is going to be *very* limited in its deployment choices,
and needs to find its own way of dealing with this.

MoinMoin, for example, has its own encoding scheme for handling
pseudo-slashes in paths, and IMO it's a better way to handle it than
trying to rely on finding a server that supports *not* decoding URLs.

Ian Bicking

unread,
Jan 24, 2008, 1:12:27 AM1/24/08
to Luis Bruno, web...@python.org
Luis Bruno wrote:
>> I made note of this issue on the WSGI 2.0 ideas page
> Didn't find it here: <URL:http://wsgi.org/wsgi/WSGI_2.0>. Should I look
> elsewhere?

I thought I had added it there, but wrote that when I was offline and
couldn't check. I added a section about it (a very brief section,
though; probably a link to this thread would be helpful).

Ian

Ian Bicking

unread,
Jan 24, 2008, 1:22:06 AM1/24/08
to Phillip J. Eby, Web SIG
Phillip J. Eby wrote:
> At 09:15 AM 1/23/2008 -0800, Robert Brewer wrote:
>> I consider it a bug in both, and the difficulty level of changing the
>> CGI behavior really has no bearing on our decision to do better with
>> WSGI. I think it's important that we allow the full range of URI's to be
>> accepted. If you go and stick Apache in front of your WSGI app, it will
>> still 404, sure; but that's your choice to use Apache or not. There's no
>> sense making WSGI a least common denominator, inheriting all the
>> limitations of all the existing web servers.
>
> Uh, actually, that's sort of the whole point of WSGI - to allow
> portable applications. If the spec allows you to do something in
> theory that's almost never allowed in practice, that's not very helpful.

It could probably work in a good number of implementations, but because
some gateways could lose or reject the encoding, the deployment becomes
kind of fragile.

Of course you could argue the same thing about SCRIPT_NAME -- it's
constantly getting lost and makes deployments seem fragile at times.
But in contrast to this issue, it's actually quite useful;
distinguishing %2f and / is more of a corner case.

> MoinMoin, for example, has its own encoding scheme for handling
> pseudo-slashes in paths, and IMO it's a better way to handle it than
> trying to rely on finding a server that supports *not* decoding URLs.

We encountered it with GData too, as it uses URLs like
/{http:%2f%2fexample.com}term/. But if you balance the {}'s you can
parse it out.

Ian

Brian Smith

unread,
Jan 24, 2008, 1:04:16 PM1/24/08
to Web SIG
Ian Bicking wrote:

> We encountered it with GData too, as it uses URLs like
> /{http:%2f%2fexample.com}term/. But if you balance the {}'s
> you can parse it out.

Unquoted curly braces are illegal in any kind of URI or IRI. Does GData
really require them to be unquoted?

- Brian

Ian Bicking

unread,
Jan 24, 2008, 2:54:34 PM1/24/08
to Brian Smith, Web SIG
Brian Smith wrote:
> Ian Bicking wrote:
>
>> We encountered it with GData too, as it uses URLs like
>> /{http:%2f%2fexample.com}term/. But if you balance the {}'s
>> you can parse it out.
>
> Unquoted curly braces are illegal in any kind of URI or IRI. Does GData
> really require them to be unquoted?

No, quoted is fine. Of course parsing PATH_INFO I couldn't tell anyway ;)

Ian

Luis Bruno

unread,
Jan 27, 2008, 7:56:29 AM1/27/08
to web...@python.org
Hello, it's me again,


Phillip J. Eby wrote:
> MoinMoin, for example, has its own encoding scheme for handling
> pseudo-slashes in paths, and IMO it's a better way to handle it than
> trying to rely on finding a server that supports *not* decoding URLs.

I had the abstract knowledge that CGI is still used for deployment, but
growing up with application servers must have spoiled me. Still, I think
nothing stops mod_wsgi passing an encoded URL down to my apps but for
adherence to the CGI spec. I've never checked it, nor the ajp + flup
combination. Something more for the todo pile.

On the short run I'll $2F my slashes. I can't actually use %252F,
because everyone seems to think they'll either get an encoded URL to
unquote() or that unquote(unquote()) is a no-op: Routes was not alone in
this.

Blake Winton wrote:
> I respectfully disagree. I've been using %-escapes in urls for years,
> intending that they get unescaped before being passed to
> applications... %7E instead of ~ mainly.
>
> in XML you can't tell the difference between <![CDATA[<]]> and &lt;
> and &#60

You've given an example of separate ways to escape the same '<'
character, and I agree that you shouldn't have to distinguish between
them. But XML does treat '<' differently from '&lt;': if you just want
to write a '<' instead of starting a tag, you need to escape it.

I don't want my SAX code[*] to deal with all the different ways to write
a literal '<'. But I expect a "<tag" to generate a start_tag event, and
"&lt;43" to be decoded into '<' in some element's text property, *not*
to generate a start_43 event.

I think the same reasoning applies to '/'. Would it apply to '~' and ';'
too?


[*] I've never actually written SAX-structured code; please pardon any
mistaeks.


> in urls I would expect the url parser to unescape things, and pass you
> the unescaped data.

Yeah, me too. I just don't want to lose information: "this was a literal
slash, not an hierarchy delimiter". But if the framework splits on the
real slashes and *then* unquotes each segment, I'd be happy to get that
list of unquoted segments. This way, my URLs use the obvious way to
escape slashes and by the time it gets to my code I have unescaped data.

This could be "dealt with" by using a REQUEST_URI instead. But then I
have to manually trim the components that URL dispatching moved into
SCRIPT_NAME. And I don't actually *have* a REQUEST_URI in the environ.

Ian Bicking wrote:
> distinguishing %2f and / is more of a corner case

I'll call it a canary in the URL mine. Should you have to balance '{'
and '}' to find the quoted namespaces for GData terms? I haven't touched
GData, but .split('/') and *then* unquoting looks like what's exactly
needed in that case.


Thank you,

Reply all
Reply to author
Forward
0 new messages