Python 2.6 and migration warnings flag for Python 3.0.

65 views
Skip to first unread message

Graham Dumpleton

unread,
Sep 17, 2008, 7:48:10 PM9/17/08
to mod...@googlegroups.com
FYI.

In subversion trunk of mod_wsgi for version 3.0, you can now if using
Python 2.6 set:

WSGIPy3kWarningFlag On

with effect hopefully being same as -3 option to 'python' executable.

Note that haven't got Python 2.6 installed so haven't actually tested it.

Graham

Brian Smith

unread,
Sep 18, 2008, 11:28:49 AM9/18/08
to mod...@googlegroups.com

What specification are you using for encoding/decoding strings in HTTP
headers? I understand that there is no PEP for it right now; I just want to
know what the expected behavior is, so I can review the code to ensure that
it matches the expected behavior.

- Brian

Graham Dumpleton

unread,
Sep 18, 2008, 7:05:42 PM9/18/08
to mod...@googlegroups.com
2008/9/19 Brian Smith <br...@briansmith.org>:

Are you asking about how it is done in Python 3.0 or older Python versions?

In older Python versions they just get passed straight through as
traditional Python strings.

In Python 3.0 I do conversions to Unicode, ie., default string type in
that version, as per discussion summary for Python 3.0 in:

http://www.wsgi.org/wsgi/Amendments_1.0

You asked about that before though, so bit confused as to what you are
wanting to know.

As to Python 2.6 and the -3 flag. I haven't looked at exactly what it
does, but I thought it just complains about syntactical elements
rather than fiddle with current distinction between traditional
strings and Unicode strings. As such, nothing has been done in C code
level for Python 2.6 and -3 option except to set the
Py_Py3kWarningFlag.

Graham

Brian Smith

unread,
Sep 19, 2008, 9:25:24 AM9/19/08
to mod...@googlegroups.com
Graham Dumpleton wrote:
> 2008/9/19 Brian Smith <br...@briansmith.org>:

> In Python 3.0 I do conversions to Unicode, ie., default
> string type in that version, as per discussion summary for
> Python 3.0 in:
>
> http://www.wsgi.org/wsgi/Amendments_1.0
>
> You asked about that before though, so bit confused as to
> what you are wanting to know.

Thanks. I had not realized that had been added to the Ammendments page. The
page says that that strings are to be converted to Latin 1 + RFC 2047. But,
although the HTTP spec. references RFC 2047, it doesn't actually explain how
to use RFC 2047 in HTTP. Consequently, the HTTP working group of the IETF is
probably going to remove all references to RFC 2047 from the HTTP 1.1
specification in its next revision. So, it actually makes more sense to
reject non-latin-1 characters outright than it does to escape them with RFC
2047 encoding.

I am also curious what happens when the HTTP request contains header fields
that cannot be decoded from latin 1. You cannot just silently strip out the
bad header fields. And, rejecting the request outright is problematic too if
the application has all of its logging, etc. done using WSGI middleware (it
won't even see the bad requests in its log).

I think Python 3.0 really needs a slightly different WSGI interface to
handle these issues--an interface where the application can access the
request headers as bytestrings for any request (including invalid ones) and
where the application can have them converted to unicode transparently when
they are valid.

- Brian

Graham Dumpleton

unread,
Sep 23, 2008, 4:01:36 AM9/23/08
to mod...@googlegroups.com
2008/9/19 Brian Smith <br...@briansmith.org>:

>
> Graham Dumpleton wrote:
>> 2008/9/19 Brian Smith <br...@briansmith.org>:
>> In Python 3.0 I do conversions to Unicode, ie., default
>> string type in that version, as per discussion summary for
>> Python 3.0 in:
>>
>> http://www.wsgi.org/wsgi/Amendments_1.0
>>
>> You asked about that before though, so bit confused as to
>> what you are wanting to know.
>
> Thanks. I had not realized that had been added to the Ammendments page. The
> page says that that strings are to be converted to Latin 1 + RFC 2047. But,
> although the HTTP spec. references RFC 2047, it doesn't actually explain how
> to use RFC 2047 in HTTP. Consequently, the HTTP working group of the IETF is
> probably going to remove all references to RFC 2047 from the HTTP 1.1
> specification in its next revision. So, it actually makes more sense to
> reject non-latin-1 characters outright than it does to escape them with RFC
> 2047 encoding.

I use PyUnicode_AsLatin1String() followed by PyBytes_AsString() when
converting Python 3.0 unicode string objects to byte strings. I didn't
understand the RFC 2047 stuff either and possible based on comments
made at the time in discussions ignored that part of it.

> I am also curious what happens when the HTTP request contains header fields
> that cannot be decoded from latin 1. You cannot just silently strip out the
> bad header fields. And, rejecting the request outright is problematic too if
> the application has all of its logging, etc. done using WSGI middleware (it
> won't even see the bad requests in its log).
>
> I think Python 3.0 really needs a slightly different WSGI interface to
> handle these issues--an interface where the application can access the
> request headers as bytestrings for any request (including invalid ones) and
> where the application can have them converted to unicode transparently when
> they are valid.

I am quite ignorant on the intricacies of unicode, but I thought the
whole thing with Latin 1 was that all 255 characters would convert and
so it couldn't fail in converting to Unicode. Presuming I haven't got
this wrong as I usually do with unicode stuff, but the following in
Python 3.0 doesn't yield an error. Don't get issues for similar thing
on Python 2.3 either.

b1=b'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

s1=str(b1, 'latin-1')

b2=bytes(s1,'latin-1')

b2

b'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

So, I get back to what I started with.

Graham

Brian Smith

unread,
Sep 23, 2008, 3:46:46 PM9/23/08
to mod...@googlegroups.com
Graham Dumpleton wrote:

> I am quite ignorant on the intricacies of unicode, but I
> thought the whole thing with Latin 1 was that all 255
> characters would convert and so it couldn't fail in
> converting to Unicode. Presuming I haven't got this wrong as
> I usually do with unicode stuff, but the following in Python
> 3.0 doesn't yield an error. Don't get issues for similar
> thing on Python 2.3 either.

Based on your test, it seems "Latin1" means ISO-8859-1 and not ISO/IEC
8859-1, so there is no problem.

- Brian

Toshio Kuratomi

unread,
Sep 29, 2008, 2:03:06 AM9/29/08
to modwsgi
I just had problems with mod_wsgi on py2.x which prompted me to look
at the code in subversion which prompted me to perform the search that
found this thread. Forgive me for coming into the discussion late.

On Sep 23, 1:01 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> 2008/9/19 Brian Smith <br...@briansmith.org>:
>
>
>
>
>
> > Graham Dumpleton wrote:
> >> 2008/9/19 Brian Smith <br...@briansmith.org>:
> >> In Python 3.0 I do conversions toUnicode, ie., default
> >> string type in that version, as per discussion summary for
> >> Python 3.0 in:
>
> >>  http://www.wsgi.org/wsgi/Amendments_1.0
>
> >> You asked about that before though, so bit confused as to
> >> what you are wanting to know.
>
> > Thanks. I had not realized that had been added to the Ammendments page. The
> > page says that that strings are to be converted to Latin 1 + RFC 2047. But,
> > although the HTTP spec. references RFC 2047, it doesn't actually explain how
> > to use RFC 2047 in HTTP. Consequently, the HTTP working group of the IETF is
> > probably going to remove all references to RFC 2047 from the HTTP 1.1
> > specification in its next revision. So, it actually makes more sense to
> > reject non-latin-1 characters outright than it does to escape them with RFC
> > 2047 encoding.
>
Ugh. That amendment page is pretty hard to follow. I'll have to read
the full discussion to see what some of those things actually mean.
Please note, though, that the open question listed in the Amendment
actually has an answer: there are variables that should not be
python-3 unicode str type. For instance, filenames on Unix
filesystems do not have an encoding. So a file could be named with
the character for "greek lowercase pi" in any number of encodings. If
the user submitted a request for your wsgi application to retrieve
that file from the filesystem (or another URL) and process it, the
wsgi application will have to receive the raw bytes that represent the
filename instead of a unicode string otherwise it won't know how to
encode the bytes to ask the filesystem (or remote web server) for the
proper file.


> I use PyUnicode_AsLatin1String() followed by PyBytes_AsString() when
> converting Python 3.0unicodestring objects to byte strings. I didn't
> understand the RFC 2047 stuff either and possible based on comments
> made at the time in discussions ignored that part of it.
>
I believe that the first part of this algorithm will fail. However,
I'm not versed in python's C API. What I think
PyUnicode_AsLatin1String() does is the equivalent of this python3
snippet:

url = '€.html' ("Euro sign".html)
url.encode('latin-1')

if so, then running that on python3.0 will show you that you'll get a
Unicode Error anytime you encounter unicode that cannot be encoded as
latin-1 (which is a problem since, as said before, filenames may need
to be referenced and they are not restricted to latin-1).

As you stated in your very first message in that discussion,
transforming from unicode to bytes needs to be done at the application
level as the application has knowledge about the encoding that is not
available elsewhere.

> > I am also curious what happens when the HTTP request contains header fields
> > that cannot be decoded from latin 1. You cannot just silently strip out the
> > bad header fields. And, rejecting the request outright is problematic too if
> > the application has all of its logging, etc. done using WSGI middleware (it
> > won't even see the bad requests in its log).
>
> > I think Python 3.0 really needs a slightly different WSGI interface to
> > handle these issues--an interface where the application can access the
> > request headers as bytestrings for any request (including invalid ones) and
> > where the application can have them converted tounicodetransparently when
> > they are valid.
>
> I am quite ignorant on the intricacies ofunicode, but I thought the
> whole thing with Latin 1 was that all 255 characters would convert and
> so it couldn't fail in converting toUnicode. Presuming I haven't got
> this wrong as I usually do withunicodestuff, but the following in
> Python 3.0 doesn't yield an error. Don't get issues for similar thing
> on Python 2.3 either.
>
That's not correct. But it's based in several facts :-)

Unicode attempts to define code points for all of the unique
characters (graphemes) in the world. On top of that, you have
encodings which turn those code points into bytes that can actually be
stored on disks and transferred across networks. One of the most
common encodings is UTF-8. UTF-8 has the property of encoding the
ASCII subset of unicode characters to the same bytes as ASCII. This
is 128 characters. Everything outside of that range needs multiple
bytes to encode and is thus not compatible with older character sets
like latin-1.

In terms of the code points assigned to the glyphs in unicode and the
bytes assigned to the characters in latin-1, there may or may not be
compatibility but that doesn't help anyone as there's no encoding of
unicode (actual bytes used to represent the characters) that has
compatibility. ie: 0xf0 is the byte that represents ð ('LATIN SMALL
LETTER ETH') in the latin-1 encoding. code point '\u00f0' is the ð
('LATIN SMALL LETTER ETH') in unicode as well. But no encoding of
unicode places the single byte 0xf0 on disk to represent that
character. UTF-8 places the bytes 0xc3 0xb0 on disk to represent it.

Most encodings of Unicode are not even compatible with ASCii. For
instance, utf-16 (which is close to python's internal representation
of unicode strings) is a two byte encoding for the ASCii and latin-1
ranges. This means that there are many Null bytes in a utf-16
encoding of ASCii or latin-1 text that are not present in a utf-8,
ASCii, or latin-1 encoding of the same data.

====

Your test is a rather round about method of showing that the glyphs
represented in latin-1 are a subset of the glyphs in unicode :-) Let
me explain line-by-line what's happening. That might give you a
better understanding of what unicode is::

> b1=b'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
> !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>

1) You define a sequence of bytes that includes the complete set of
possible values. Pretty straight forward. The one interesting thing
here is that you are also showing that the printable ASCII characters
are a subset of latin-1 due to putting the ASCII printable characters
into this byte sequence.

> s1=str(b1, 'latin-1')
>
2) Now you say pretend these bytes are latin-1 characters. Give me a
unicode string with the equivalent glyphs to those characters.

At this point you have an abstract string of characters stored in s1.
(Internally they are two byte sequences representing each character
but externally, they are just characters on your screen.)

> b2=bytes(s1,'latin-1')
>
3) This says, take the characters in my abstract, unicode string and
create specific bytes for them using the latin-1 encoding.

> b2
>
> b'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
> !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>
> So, I get back to what I started with.
>
4) You print those bytes and get back the same sequence as you started
with. This is because latin-1 is a subset of unicode. so you were
able to transform your bytes into unicode and back without loss of
data.

Some things to try that could help you understand more of what's going
on:

* You can use a different encoding for step #2 and #3 (for instance,
cp737 which has Greek letters). If the encoding is also a subset of
latin-1 you'll get back what you started with in step #4.

* You can ask python to pretend that the byte sequence is an encoding
of unicode ('utf-8', 'utf-16', or 'utf-32', for instance) in step #2.
Those will all raise a Unicode Error as the bytes are not compatible
with any of those encodings.

In step #3 you can transform your string from unicode to an encoding
of unicode (for instance, 'utf-8'). This will yield different bytes
in the end but they will represent the same characters as your
original latin-1 characters did, just in a different encoding. To
turn them back, you would then do:

s2 = str(b2, 'utf-8')
b3 = s2.encode('latin-1')
b3

In summary, I think mod_wsgi should not be transforming bytes into
unicode or unicode into bytes. That is something that the application
has to handle. But I haven't read through the discussion yet so there
might be a specific counter to my filesystem argument. I'll have to
keep reading.
-Toshio

Graham Dumpleton

unread,
Sep 29, 2008, 3:47:04 AM9/29/08
to mod...@googlegroups.com
It will take me a while to absorb what you are saying but a few comments.

The most important thing to realise is that for wgsi.input it will be
a stream of bytes. Thus it is up to the application when processing
the request content to say what encoding is to be used when converting
to Unicode. This would be presumably determined by application from
ENCTYPE as understood from form post, or through some other
application knowledge about request content.

As to the HTTP request headers, the RFCs say they are effectively
latin-1. Thus, all HTTP_? variables in WSGI environ can only be
processed as latin-1 when converting to Unicode.

As to the remainder of the WSGI environment, these come from
underlying Apache web server which generally treats them as latin-1 as
well because it uses normal C character strings in processing and
doesn't deal with encodings as far as I know. That is, not sure that
Apache supports UTF or Unicode configuration files and since data is
often sourced from there, would transfer across as latin-1 as a
result. Others derive from information in HTTP headers and so also
would be latin-1.

What you say about file path may or may not be relevant, am not sure.
The WSGI environment variable this would effect would be
SCRIPT_FILENAME which isn't even a WSGI requirement but just happens
to be something that Apache provides. The value of that will be a
composite of latin-1 strings from Apache configuration with snippets
from path in URI. The path in URI is also present in REQUEST_URI and
anything beyond application mount point in PATH_INFO. What I don't
know is what the RFC specifications say about the path in the URI and
what characters are allowed in the original path in the URI and so how
that affects what could end up in these variables. For SCRIPT_NAME,
which is the application URL mount point, since that is defined by
Apache configuration, it would appear to be restricted to latin-1,
although not sure what would happen for case that wildcard or regular
expression is used in doing the match. Also, for resource based match,
actually using path partly from file as it appears in file system.

For response headers and content, the application can either generate
bytes and thus control the encoding, or it will fallback to trying to
convert it as latin-1 if Unicode supplied, so like wsgi.input, no
problem there.

So, as far as I can see, it all comes done to SCRIPT_FILENAME,
REQUEST_URI, PATH_INFO and SCRIPT_NAME and here it is in part dictated
by what characters can occur in the URI in the original request line
and if encodings are allowed, how they are annotated as being of a
particular encoding. Thus it seems one probably needs to look at
rfc2396 or rfc3986 to work that out, but it also is going to be
dictated by how Apache internally treats everything as normal C
character strings and not wide character strings. or the last of those
variables, then file system path encodings, present or not, also are
an issue.

Anyone else know any better so as to comment?

Graham


2008/9/29 Toshio Kuratomi <a.ba...@gmail.com>:

Brian Smith

unread,
Sep 29, 2008, 11:11:52 AM9/29/08
to mod...@googlegroups.com
Graham Dumpleton wrote:
> As to the HTTP request headers, the RFCs say they are
> effectively latin-1. Thus, all HTTP_? variables in WSGI
> environ can only be processed as latin-1 when converting to Unicode.

Anything that is part of a URI (e.g. SCRIPT_NAME, REQUEST_URI) must be ASCII
by definition.

Anything that comes from a a HTTP header must be Latin-1 according to the
specification. However, there are applications (especially in Asia) where
raw (unencoded) UTF-8/Shift-JIS/etc. octet sequences get put in HTTP
headers. For example, Internet Explorer expects this kind of encoding
sometimes for Content-Disposition. I encourage everybody to avoid non-ASCII
data in headers whenever possible.

Anything that comes from a file path will be a raw string of bytes (on
Linux), that you should interpret according to the file system's encoding
(usually UTF-8 on modern Linuxes, I believe). In particular, do not assume
that you can just include a file name or path in a URI without encoding it,
and don't assume that file paths can be encoded into ASCII or Latin-1.

Another thing to consider is the encoding of Apache's configuration files. I
don't know what encoding it uses.

- Brian

Clodoaldo Pinto Neto

unread,
Sep 29, 2008, 11:24:20 AM9/29/08
to mod...@googlegroups.com
2008/9/29 Brian Smith <br...@briansmith.org>:

This morning i tried in my Fedora 8 desktop, wich is utf-8, including
all the files, this:

DocumentRoot /var/www/html/çãé

and it worked and sent index.html as expected.

Regards, Clodoaldo

> - Brian
>
>
> >
>

Toshio Kuratomi

unread,
Sep 29, 2008, 2:08:24 PM9/29/08
to modwsgi


On Sep 29, 12:47 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> It will take me a while to absorb what you are saying but a few comments.
>
> The most important thing to realise is that for wgsi.input it will be
> a stream of bytes. Thus it is up to the application when processing
> the request content to say what encoding is to be used when converting
> toUnicode. This would be presumably determined by application from
> ENCTYPE as understood from form post, or through some other
> application knowledge about request content.
>
<nod>. This is as it should be.

> As to the HTTP request headers, the RFCs say they are effectively
> latin-1. Thus, all HTTP_? variables in WSGI environ can only be
> processed as latin-1 when converting toUnicode.
>
Converting these headers to unicode will lead to mangled data at
times. Let's say that some web app needs to keep track of the referer
information for some reason. If the app is referred to from
http://localhost/€.html ("Euro symbol".html ) and it is encoded as
utf-8 on the server then the server will send a header with this
sequence of bytes::

Referer http://localhost/%e2%82%ac.html

If mod_wsgi assumes latin-1 and converts that into unicode before it
hits the app, the app will see this::

Referer http://localhost/â%82¬.html

Does mod_wsgi have knowledge of the encoding that the server used for
the file? I'm somewhat doubtful of that as apache itself doesn't have
knowledge of the encoding that the filename on the server is in, it's
just passing the bytes that the filename has on the filesystem.

> As to the remainder of the WSGI environment, these come from
> underlying Apache web server which generally treats them as latin-1 as
> well because it uses normal C character strings in processing and
> doesn't deal with encodings as far as I know. That is, not sure that
> Apache supports UTF orUnicodeconfiguration files and since data is
> often sourced from there, would transfer across as latin-1 as a
> result. Others derive from information in HTTP headers and so also
> would be latin-1.
>
Note: latin-1 is an encoding. So you should be using "bytes" or
"sequence of bytes" in the above. Because latin-1 is an encoding
where all bytes from 0x00 to 0xff are valid, the effect in terms of
the byte stream is the same. However, it is not the same when we
think about converting "a byte stream to unicode" vs converting
"latin-1 to unicode". The latter is possible. The former is only
possible if there actually is an encoding that you know or you're
willing to guess. So apache transfers unicode data just fine, the
problem is if something interprets the data as belonging to a specific
encoding (like latin-1) that it is not.

> What you say about file path may or may not be relevant, am not sure.
> The WSGI environment variable this would effect would be
> SCRIPT_FILENAME which isn't even a WSGI requirement but just happens
> to be something that Apache provides. The value of that will be a
> composite of latin-1 strings from Apache configuration with snippets
> from path in URI. The path in URI is also present in REQUEST_URI and
> anything beyond application mount point in PATH_INFO. What I don't
> know is what the RFC specifications say about the path in the URI and
> what characters are allowed in the original path in the URI and so how
> that affects what could end up in these variables. For SCRIPT_NAME,
> which is the application URL mount point, since that is defined by
> Apache configuration, it would appear to be restricted to latin-1,
> although not sure what would happen for case that wildcard or regular
> expression is used in doing the match. Also, for resource based match,
> actually using path partly from file as it appears in file system.
>
Note: Since apache is just dealing with byte sequences, latin-1 is
actually not a "restriction". Any single byte from 0x00 to 0xff is
valid latin-1 as your test showed. Since apache is just treating the
URIs as a sequence of bytes, they can be made up of any sequence of
bytes. Any multibyte character will be served fine as a sequence of
multiple bytes. The problem is if the wsgi server treats one of these
multibyte characters as latin-1. The bytes might not represent
latin-1 data; they could represent multi-byte characters instead.

> For response headers and content, the application can either generate
> bytes and thus control the encoding, or it will fallback to trying to
> convert it as latin-1 ifUnicodesupplied, so like wsgi.input, no
> problem there.
>
Unlike wsgi.input where the application *must* decide how to decode
the data, you are trying to do automatic encoding of data in the wsgi
server here. This will cause tracebacks on some unicode string input
but not others (which is one of the reasons that people hate unicode
handling in python-2). The tracebacks occur because latin-1
characters are a subset of Unicode characters (note that we're not
dealing with code-point to byte mapping here, we're dealing with
character mapping). So you can always convert latin-1 to unicode.
But you can't always convert Unicode to latin-1 (which is what this
automatic conversion would attempt). It's much better for the
application layer to always hand mod_wsgi byte types, never unicode.

Note, that I see in the wsgi spec that any unicode handed to the
server must be the subset whose code points are latin-1 equivalent.
This takes care of the problem but is somewhat silly. We're basically
using latin-1 as a marshalling format for passing bytes over the
wire. So we have to convert the unicode to bytes as the first step
in changing unicode characters outside the latin-1 range into bytes
that can go over the wire. At that point converting the bytes back to
unicode pretending they're latin-1 instead of utf-8 is just an extra
step for no reason.

> So, as far as I can see, it all comes done to SCRIPT_FILENAME,
> REQUEST_URI, PATH_INFO and SCRIPT_NAME and here it is in part dictated
> by what characters can occur in the URI in the original request line
> and if encodings are allowed, how they are annotated as being of a
> particular encoding. Thus it seems one probably needs to look at
> rfc2396 or rfc3986  to work that out, but it also is going to be
> dictated by how Apache internally treats everything as normal C
> character strings and not wide character strings. or the last of those
> variables, then file system path encodings, present or not, also are
> an issue.
>
> Anyone else know any better so as to comment?
>
There are many times when the webserver doesn't know the encoding of
something so cannot annotate or encode the data. Since file system
paths are an easy example let's continue to run with that. Take a look
at this page::
http://toshio.fedorapeople.org/u/

I have two files there. Both are named ½ñ.html. (one-half tilde-
lowercase-n .html). However one of the filenames is encoded with
latin-1 and the other with utf-8. If you switch between character
encodings for the web page (firefox3: View::Character Encoding::UTF-8
vs View::Character Encoding::Western (iso 8859-1) ) you'll see that
you can make one or the other show its name correctly. Why isn't
apache able to display both correctly at the same time? It's because
apache doesn't know what the encoding of the filenames are. The
filesystem is just handing it as a sequence of bytes.

The URI to request these properly is also a sequence of bytes. If you
view source on the page you'll see that each is referenced with a byte
sequence::
%bd%f1.html
%c2%bd%c3%b1.html

(Note: firefox3 has some bugs here. If you look at the URL's firefox
is going to generate for the links in the statusbar, you'll see that
they're both ½ñ.html. The first time I click on either of the links,
firefox takes me to the utf-8 named page. If I manually load each of
the URLs via the byte sequence, firefox then stores the fact that
there's two separate pages and can retrieve either of them via the
link.)

Also note that nowhere in the page is there an annotation of the
encoding for either of these sequences. That's because Apache does
not know the encoding.

As far as I can tell, the WSGI specification almost got unicode
handling right::
http://www.python.org/dev/peps/pep-0333/#unicode-issues

Paragraph 1 is perfect. Paragraph 2 and 4 are interpretable in good
or bad ways but if they existed with just paragraph 1 the tendency
would probably be to interpret it in a good way. Paragraph 3 is
entirely wrong for python3. The other paragraphs amount to: Use byte
strings everywhere. Do not use unicode strings. Paragraph 3 says, in
python 3, use unicode strings and make sure the characters are
restricted to code points that map to latin-1 characters. If this
were followed it would make life extremely difficult for application
writers. Let's take €.html (Euro symbol.html) as an example since
the Euro symbol doesn't occur in latin-1:

Let's say that I want to see if I was called as €.html (Euro
symbol.html). This is what happens (Note: python3 environment):

mod_wsgi receives a sequence of bytes from apache.
It transforms those into unicode by pretending that those bytes are
latin-1 and sticks them into SCRIPT_NAME.

The application then has to::
marshalled_form = environ['SCRIPT_NAME']
byte_sequence = url.encode('latin-1')
real_string = str(byte_sequence, 'utf-8')

If we were forwarding this on to the application as a sequence of
bytes instead, we could skip mod_wsgi changing those bytes to a wrong
unicode string and the application converting the unicode string back
into bytes.

Now let's look at the reverse case: Let's say that the application
wants to redirect the user to €.html (Euro symbol.html). For that,
they have to enter this into the location header::
real_url = '€.html'
byte_sequence = real_url.encode('utf-8')
marshalled_form = str(byte_sequence, 'latin-1')
headers = [('location', marshalled_form)]

mod_wsgi then has to loop through the headers taking all those unicode
strings and convert them back into byte_sequences. It's much better
if we just pass things as byte type into and out of mod_wsgi and let
the application decide if it needs to operate on them as unicode
strings.

-Toshio

Brian Smith

unread,
Sep 29, 2008, 6:24:39 PM9/29/08
to mod...@googlegroups.com
Toshio Kuratomi wrote:

> "Graham Dumpleton" wrote:
> > As to the HTTP request headers, the RFCs say they are effectively
> > latin-1. Thus, all HTTP_? variables in WSGI environ can only be
> > processed as latin-1 when converting toUnicode.
> >
> Converting these headers to unicode will lead to mangled data
> at times. Let's say that some web app needs to keep track of
> the referer information for some reason. If the app is
> referred to from http://localhost/€.html ("Euro symbol".html
> ) and it is encoded as
> utf-8 on the server then the server will send a header with
> this sequence of bytes::
>
> Referer http://localhost/%e2%82%ac.html
>
> If mod_wsgi assumes latin-1 and converts that into unicode
> before it hits the app, the app will see this::
>
> Referer http://localhost/â%82¬.html

No, it will leave it as http://localhost/%e2%82%ac.html. It does (or should do) the Latin-1-to-Unicode conversion before it decodes URL encoding.

> Unlike wsgi.input where the application *must* decide how to
> decode the data, you are trying to do automatic encoding of
> data in the wsgi server here. This will cause tracebacks on
> some unicode string input but not others (which is one of the
> reasons that people hate unicode handling in python-2). The
> tracebacks occur because latin-1 characters are a subset of
> Unicode characters (note that we're not dealing with
> code-point to byte mapping here, we're dealing with character
> mapping). So you can always convert latin-1 to unicode.
> But you can't always convert Unicode to latin-1 (which is
> what this automatic conversion would attempt). It's much
> better for the application layer to always hand mod_wsgi byte
> types, never unicode.

The HTTP standards mandates Latin-1. Python 3.0 says all strings are Unicode. The encoding/decoding is needed to bridge the gap. Treating the HTTP headers as raw sequences of bytes and requiring Python applications to do their own manual decoding/encoding would not be Pythonic and the Python community wouldn't accept it.

> This takes care of the problem but is somewhat silly. We're
> basically using latin-1 as a marshalling format for passing
> bytes over the wire. So we have to convert the unicode to
> bytes as the first step in changing unicode characters
> outside the latin-1 range into bytes that can go over the
> wire. At that point converting the bytes back to unicode
> pretending they're latin-1 instead of utf-8 is just an extra
> step for no reason.

Again, I think you are misunderstanding the interaction between URL encoding and character encoding conversion. Mod_wsgi will (should) never do or undo URL-encoding itself for non-ASCII (%80-%FF) sequences.

> I have two files there. Both are named ½ñ.html. (one-half
> tilde- lowercase-n .html). However one of the filenames is
> encoded with
> latin-1 and the other with utf-8. If you switch between
> character encodings for the web page (firefox3:
> View::Character Encoding::UTF-8 vs View::Character
> Encoding::Western (iso 8859-1) ) you'll see that you can make
> one or the other show its name correctly. Why isn't apache
> able to display both correctly at the same time? It's
> because apache doesn't know what the encoding of the
> filenames are. The filesystem is just handing it as a
> sequence of bytes.

That issue doesn't really affect mod_wsgi though. All mod_wsgi can do is try to decode filenames given the information the OS gives it. If the users are using multiple encodings for file names then they deserve whatever bad behavior they get. Actually, I would go further and say that if you are using any encoding for filenames other than NFC UTF-8 on Linux then you are asking for trouble.

> mod_wsgi receives a sequence of bytes from apache.
> It transforms those into unicode by pretending that those bytes are
> latin-1 and sticks them into SCRIPT_NAME.

IMO, mod_wsgi should just drop SCRIPT_NAME and all other non-WSGI environ keys except REQUEST_URI (REQUEST_URI is needed to get the raw, un-decoded URI).

> Now let's look at the reverse case: Let's say that the
> application wants to redirect the user to €.html (Euro
> symbol.html). For that, they have to enter this into the
> location header::
> real_url = '€.html'
> byte_sequence = real_url.encode('utf-8')
> marshalled_form = str(byte_sequence, 'latin-1')
> headers = [('location', marshalled_form)]

No, they have to URL-encode mashalled_form into ASCII first, because the Location header holds a URI, and URIs are always ASCII-only.

Regards,
Brian

Graham Dumpleton

unread,
Sep 29, 2008, 7:33:14 PM9/29/08
to mod...@googlegroups.com
2008/9/30 Toshio Kuratomi <a.ba...@gmail.com>:

>> For response headers and content, the application can either generate
>> bytes and thus control the encoding, or it will fallback to trying to
>> convert it as latin-1 ifUnicodesupplied, so like wsgi.input, no
>> problem there.
>>
> Unlike wsgi.input where the application *must* decide how to decode
> the data, you are trying to do automatic encoding of data in the wsgi
> server here. This will cause tracebacks on some unicode string input
> but not others (which is one of the reasons that people hate unicode
> handling in python-2). The tracebacks occur because latin-1
> characters are a subset of Unicode characters (note that we're not
> dealing with code-point to byte mapping here, we're dealing with
> character mapping). So you can always convert latin-1 to unicode.
> But you can't always convert Unicode to latin-1 (which is what this
> automatic conversion would attempt). It's much better for the
> application layer to always hand mod_wsgi byte types, never unicode.

The amendment page says:

When running under Python 3, applications SHOULD produce bytes
output and headers

So, the ideal situation is that the application would always produce
bytes and so it is the application which is supposed to deal with it.

That mod_wsgi fallbacks to converting any Unicode strings to bytes is
a fail safe as dictated by:

When running under Python 3, servers and gateways MUST accept
strings as application output or headers, under the existing rules (i.e.,
s.encode('latin-1') must convert the string to bytes without an
exception)

and is more to protect lazy programmers, plus make it easier to port
WSGI applications for Python 2.X.

In other words, your application is the one who should be dealing with
it in the first place if you want to be sure about what is being
produced. It only becomes an issue where the WSGI application hasn't
done what it really should have done.

Graham

Graham Dumpleton

unread,
Sep 29, 2008, 7:38:43 PM9/29/08
to mod...@googlegroups.com
2008/9/30 Brian Smith <br...@briansmith.org>:

>> mod_wsgi receives a sequence of bytes from apache.
>> It transforms those into unicode by pretending that those bytes are
>> latin-1 and sticks them into SCRIPT_NAME.
>
> IMO, mod_wsgi should just drop SCRIPT_NAME and all other non-WSGI environ keys except REQUEST_URI (REQUEST_URI is needed to get the raw, un-decoded URI).

Did you perhaps mean SCRIPT_FILENAME. The WSGI specification requires
SCRIPT_NAME.

As to this whole discussion, as much as it is interesting there is
nothing I can do about it. It really needs to be brought up on the
Python WEB-SIG where I originally raised the issue of Python 3.0
support for WSGI. I can only implement what consensus comes out of
discussion on Python WEB-SIG in lieu of them not wanting to come out
with an official revised specification for WSGI.

Graham

Toshio Kuratomi

unread,
Sep 30, 2008, 3:11:10 AM9/30/08
to modwsgi


On Sep 29, 3:24 pm, "Brian Smith" <br...@briansmith.org> wrote:
> Toshio Kuratomi wrote:
> > "Graham Dumpleton" wrote:
> > > As to the HTTP request headers, the RFCs say they are effectively
> > > latin-1. Thus, all HTTP_? variables in WSGI environ can only be
> > > processed as latin-1 when converting toUnicode.
>
> > Converting these headers tounicodewill lead to mangled data
> > at times.  Let's say that some web app needs to keep track of
> > the referer information for some reason.  If the app is
> > referred to fromhttp://localhost/€.html ("Euro symbol".html
> > ) and it is encoded as
> > utf-8 on the server then the server will send a header with
> > this sequence of bytes::
>
> >   Referer  http://localhost/%e2%82%ac.html
>
> > If mod_wsgi assumes latin-1 and converts that intounicode
> > before it hits the app, the app will see this::
>
> >   Refererhttp://localhost/â%82¬.html
>
> No, it will leave it ashttp://localhost/%e2%82%ac.html. It does (or should do) the Latin-1-to-Unicodeconversion before it decodes URL encoding.
>
uhm... you're wrong here. url encoding and decoding operates on
bytes. unicode is not bytes. so you can't go from byte string to
unicode and then pass it through url decode. Or I suppose you can,
but it isn't by any means the opposite of what you did to get the url
escaped bytes so it's pretty senseless.

> > Unlike wsgi.input where the application *must* decide how to
> > decode the data, you are trying to do automatic encoding of
> > data in the wsgi server here.  This will cause tracebacks on
> > someunicodestring input but not others (which is one of the
> > reasons that people hateunicodehandling in python-2).  The
> > tracebacks occur because latin-1 characters are a subset of
> >Unicodecharacters (note that we're not dealing with
> > code-point to byte mapping here, we're dealing with character
> > mapping).  So you can always convert latin-1 tounicode.
> > But you can't always convertUnicodeto latin-1 (which is
> > what this automatic conversion would attempt). It's much
> > better for the application layer to always hand mod_wsgi byte
> > types, neverunicode.
>
> The HTTP standards mandates Latin-1. Python 3.0 says all strings areUnicode. The encoding/decoding is needed to bridge the gap. Treating the HTTP headers as raw sequences of bytes and requiring Python applications to do their own manual decoding/encoding would not be Pythonic and the Python community wouldn't accept it.
>
I disagree. You are dealing with byte sequences here so you need to
call them bytes. This *is* pythonic (as much as you can define that
for a type that hasn't existed before :-). Look at the WSGI
specification for python-2. It specifies storing the values in str
type and not in unicode type and that's accepted by the Python
community as Pythonic.

> > This takes care of the problem but is somewhat silly.  We're
> > basically using latin-1 as a marshalling format for passing
> > bytes over the wire.  So we have to convert  theunicodeto
> > bytes as the first step in changingunicodecharacters
> > outside the latin-1 range into bytes that can go over the
> > wire.  At that point converting the bytes back tounicode
> > pretending they're latin-1 instead of utf-8 is just an extra
> > step for no reason.
>
> Again, I think you are misunderstanding the interaction between URL encoding and character encoding conversion. Mod_wsgi will (should) never do or undo URL-encoding itself for non-ASCII (%80-%FF) sequences.
>
I think that you are misunderstanding the interaction. And I thing
that % sequences should definitely be done by mod_wsgi. Ending up
with a unicode string containing %encoded sequences is even worse than
the other scenarios I described as the application then has to convert
from unicode to byte string, unquote the url quoting, and then convert
back to unicode. (Although this is alleviated in python3 by the fact
that urllib.parse.quote()/unquote() take an encoding argument. So the
extra steps are taken care of by the function).

It would be much better for mod_wsgi to do the url quoting for the
user as converting between bytes and %escape sequences is 100%
automatable. This is unlike converting between unicode and a sequence
of bytes where something has to decide what the character encoding
is. So -- WSGI should take care of %encoding because that's a job for
a computer anyway. WSGI should not take care of the byte=> unicode
conversion because it doesn't know what enconding the bytes are in.

> > I have two files there.  Both are named  ½ñ.html. (one-half
> > tilde- lowercase-n .html).  However one of the filenames is
> > encoded with
> > latin-1 and the other with utf-8.  If you switch between
> > character encodings for the web page (firefox3:
> > View::Character Encoding::UTF-8 vs View::Character
> > Encoding::Western (iso 8859-1) ) you'll see that you can make
> > one or the other show its name correctly.  Why isn't apache
> > able to display both correctly at the same time?  It's
> > because apache doesn't know what the encoding of the
> > filenames are.  The filesystem is just handing it as a
> > sequence of bytes.
>
> That issue doesn't really affect mod_wsgi though. All mod_wsgi can do is try to decode filenames given the information the OS gives it. If the users are using multiple encodings for file names then  they deserve whatever bad behavior they get. Actually, I would go further and say that if you are using any encoding for filenames other than NFC UTF-8 on Linux then you are asking for trouble.

It affects the apps. mod_wsgi can either make it as easy as possible
for the apps to get to the filenames or make it hard. Trying to
decode the filenames with a wrong encoding makes it hard. Leaving the
user with bytes is much easier for the apps to handle.

I'd love for the whole world to be utf-8. Unfortunately, that's just
not the case. So you can either fail miserably or write code that
deals with it. Look at the technical difficulties with utf-8 locales
and Asian scripts for one reason that utf-8 everywhere is a hope but
not a reality.

> > Now let's look at the reverse case:  Let's say that the
> > application wants to redirect the user to €.html (Euro
> > symbol.html).  For that, they have to enter this into the
> > location header::
> >   real_url = '€.html'
> >   byte_sequence = real_url.encode('utf-8')
> >   marshalled_form = str(byte_sequence, 'latin-1')
> >   headers = [('location', marshalled_form)]
>
> No, they have to URL-encode mashalled_form into ASCII first, because the Location header holds a URI, and URIs are always ASCII-only.
>
Well... between marshalled_form and HTTP HEADER, there needs to be a
url escaping sequence. but whether that needs to happen outside of
mod_wsgi or inside is part of what you and I are debating. You do see
from your example above why your initial sequence for decoding at the
top of the post is wrong, though? Your decoding sequence at the top
placed the ASCII escaping between byte_sequence and real_url instead
of between marshalled_form and headers.

-Toshio

Toshio Kuratomi

unread,
Sep 30, 2008, 3:22:47 AM9/30/08
to modwsgi


On Sep 29, 4:33 pm, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> 2008/9/30 Toshio Kuratomi <a.bad...@gmail.com>:
>
>
>
> >> For response headers and content, the application can either generate
> >> bytes and thus control the encoding, or it will fallback to trying to
> >> convert it as latin-1 ifUnicodesupplied, so like wsgi.input, no
> >> problem there.
>
> > Unlike wsgi.input where the application *must* decide how to decode
> > the data, you are trying to do automatic encoding of data in the wsgi
> > server here.  This will cause tracebacks on someunicodestring input
> > but not others (which is one of the reasons that people hateunicode
> > handling in python-2).  The tracebacks occur because latin-1
> > characters are a subset ofUnicodecharacters (note that we're not
> > dealing with code-point to byte mapping here, we're dealing with
> > character mapping).  So you can always convert latin-1 tounicode.
> > But you can't always convertUnicodeto latin-1 (which is what this
> > automatic conversion would attempt). It's much better for the
> > application layer to always hand mod_wsgi byte types, neverunicode.
>
> The amendment page says:
>
>   When running under Python 3, applications SHOULD produce bytes
> output and headers
>
> So, the ideal situation is that the application would always produce
> bytes and so it is the application which is supposed to deal with it.
>
> That mod_wsgi fallbacks to converting anyUnicodestrings to bytes is
> a fail safe as dictated by:
>
>   When running under Python 3, servers and gateways MUST accept
>   strings as application output or headers, under the existing rules (i.e.,
>   s.encode('latin-1') must convert the string to bytes without an
>   exception)
>
>  and is more to protect lazy programmers, plus make it easier to port
> WSGI applications for Python 2.X.
>
So there's two things here:
1) Maybe I'm misunderstanding some code but I thought mod_wsgi was
decoding bytes going out to the app. If that's not the case and
mod_wsgi is only handing byte strings to the apps then that's fine.
(I note that this interaction isn't specified in the Amendment which
goes along with your general feeling on the problems with the WSGI-
spec writing process.)

2) pje said that accepting unicode str here would make it easier to
port WSGI applications but that's actually not true. In python-2.x,
you are only supposed to pass byte strings (py-2.x str) so there's no
problems. When those str's are converted to unicode str in py3.x, you
have to rewrite your code so you aren't passing non-latin-1
characters. At that point, there's zero incentive to pass a sanitized
unicode string to the wsgi server as you had to go through the byte
type in order to get there (unless you misunderstand the WSGI spec and
think it wants you to send py-3.x str type.)

As for protecting lazy programmers... I'd argue that it's much better
to throw an exception immediately upon receiving a unicode type rather
than waiting until your app starts getting popular and you suddenly
have transient errors due to people occassionally submitting data with
non-latin-1 characters.

> In other words, your application is the one who should be dealing with
> it in the first place if you want to be sure about what is being
> produced.

+100

> It only becomes an issue where the WSGI application hasn't
> done what it really should have done.
>
As long as mod_wsgi is only converting unicode to bytes and not
converting bytes to unicode, this is true.

-Toshio

Toshio Kuratomi

unread,
Sep 30, 2008, 3:26:16 AM9/30/08
to modwsgi


On Sep 29, 4:38 pm, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:

> As to this whole discussion, as much as it is interesting there is
> nothing I can do about it. It really needs to be brought up on the
> Python WEB-SIG where I originally raised the issue of Python 3.0
> support for WSGI. I can only implement what consensus comes out of
> discussion on Python WEB-SIG in lieu of them not wanting to come out
> with an official revised specification for WSGI.
>
So I have a couple questions:

Do you agree with or disagree with my analysis that byte type is the
ideal going in and out of WSGI?

Do you agree that pje's argument as to why unicode strings should be
accepted is specious?

If you agree on those, I'll start a new argument on python-web-sig and
see if I can get this changed. There's a high probability that it'll
just end with pje and I disagreeing with each other but I'll try my
hand as long as someone else who's been implementing WSGI servers
thinks that it's the correct approach.

Thanks!
-Toshio

Graham Dumpleton

unread,
Sep 30, 2008, 3:51:22 AM9/30/08
to mod...@googlegroups.com
2008/9/30 Toshio Kuratomi <a.ba...@gmail.com>:

I thought I had made it clear enough and that the proposed amendments
were also clear on this.

The wsgi.input stream which contains the request content is 'bytes'.
Thus it is not touched by mod_wsgi. The amendments say:

When running under Python 3, servers MUST make wsgi.input a
binary (byte) stream

Though amendments do though also say:

When running under Python 3, servers MUST provide CGI HTTP variables
as strings, decoded from the headers using HTTP standard encodings
(i.e. latin-1 + RFC 2047) (Open question: are there any CGI or WSGI
variables that should NOT be strings?)

Thus, mod_wsgi does however convert the CGI variables (ie., translated
HTTP headers) in WSGI environment dictionary, into Unicode strings
using latin-1 encoding.

As I pointed out there were only a few variables in there which were
of concern. Brian has pointed out that request URI has to be ascii
characters but there possibly still is an open question there on how
encoding of non ascii characters works in practice. We just need to do
some actual tests to see what happens and whether there is a problem.

Thus we are possibly down to SCRIPT_FILENAME given that it is
reflecting a file system path. Again, we just need to do some actual
tests to see what happens. Remembering that Apache is going to dictate
in the main how things work.

> 2) pje said that accepting unicode str here would make it easier to
> port WSGI applications but that's actually not true. In python-2.x,
> you are only supposed to pass byte strings (py-2.x str) so there's no
> problems. When those str's are converted to unicode str in py3.x, you
> have to rewrite your code so you aren't passing non-latin-1
> characters. At that point, there's zero incentive to pass a sanitized
> unicode string to the wsgi server as you had to go through the byte
> type in order to get there (unless you misunderstand the WSGI spec and
> think it wants you to send py-3.x str type.)
>
> As for protecting lazy programmers... I'd argue that it's much better
> to throw an exception immediately upon receiving a unicode type rather
> than waiting until your app starts getting popular and you suddenly
> have transient errors due to people occassionally submitting data with
> non-latin-1 characters.

My feeling was that fallback to converting to bytes using latin-1 was
so that simple applications would still work. For example, the hello
world application:

def application(environ, start_response):
status = '200 OK'
output = 'Hello World!'

response_headers = [('Content-type', 'text/plain'),
('Content-Length', str(len(output)))]
start_response(status, response_headers)

return [output]

works in by Python 2.X and 3.0 without change.

Larger applications such as Django already internally deal with all
response content as Unicode and convert it to string objects at last
minute. The 2to3 converter would presumably pick that up automatically
and make it produce bytes instead.

Request headers in Django are a bit different more interesting. At the
moment, it will do things like:

path_info = force_unicode(environ.get('PATH_INFO', u'/'))

where force_unicode is:

def force_unicode(s, encoding='utf-8', strings_only=False,
errors='strict'): ...

Thus, Django was converting Python 2.X string objects to Unicode but
as UTF-8, which technically may not be correct.

In Python 3.0 because this conversion will likely still be applied
when 2to3 conversion done, they may well be converting Unicode string
created as latin-1 to Unicode string as UTF-8, albeit possibly by
going back through bytes type to do it if I read code correctly.

So, issue there is whether that they are treating them as UTF-8 is
right given that amendment is suggesting CGI variables are supposed to
be handled as latin-1.

Anyway, that is getting a bit off topic.

>> In other words, your application is the one who should be dealing with
>> it in the first place if you want to be sure about what is being
>> produced.
>
> +100
>
>> It only becomes an issue where the WSGI application hasn't
>> done what it really should have done.
>>
> As long as mod_wsgi is only converting unicode to bytes and not
> converting bytes to unicode, this is true.

I have already explained that for CGI variables (translated HTTP
headers) in the WSGI environment dictionary, that mod_wsgi does
convert bytes to Unicode.

Graham

Graham Dumpleton

unread,
Sep 30, 2008, 4:02:57 AM9/30/08
to mod...@googlegroups.com
Can we stop with the mod_wsgi should do this or mod_wsgi should do
that. The Apache/mod_wsgi module is just one implementation of the
WSGI specification. You need when talking about this to look at the
bigger picture and what other implementations exist, plus how they all
work and interact with the web server they use.

Take CGI for example. If you are using a CGI-WSGI adapter, the WSGI
environment will come in through os.environ. If you run Python 3.0 and
look at os.environ you will get:

Python 3.0rc1 (r30rc1:66499, Sep 18 2008, 21:39:06)
[GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['PATH']
'/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/ose/bin:/usr/local/bin:/Users/grahamd/bin'
>>> type(os.environ['PATH'])
<class 'str'>

So, os.environ already holds values as Unicode string objects and not
bytes. Thus there is no chance of them being passed to application as
bytes.

How they get to become Unicode strings depend on the platform. For
Windows it uses:

PyUnicode_FromWideChar()

So, input is Unicode to begin with.

On UNIX boxes it uses:

PyUnicode_FromString()

which presumably means it uses default system encoding whatever that might be.

Anyway, already you are stopped from communicating bytes to WSGI
application. One could say that proposed amendments to specification
for Python 3.0 don't even consider this case where conversion already
done for you.

Anyway, I have to leave off for now as have to go home. As I sort of
suggest above, keep in mind that the proposed amendments are trying to
find a compromise that works for many hosting environments. Thus
although you ideally may want bytes everywhere, that may not work in
practice.

Graham


2008/9/30 Toshio Kuratomi <a.ba...@gmail.com>:

Graham Dumpleton

unread,
Sep 30, 2008, 6:30:22 AM9/30/08
to mod...@googlegroups.com
The BaseHTTPRequestHandler in http.server of Python 3.0 also only
makes headers available as Unicode (latin-1).

headers = []
while True:
line = self.rfile.readline()
headers.append(line)
if line in (b'\r\n', b'\n', b''):
break
hfile = io.StringIO(b''.join(headers).decode('iso-8859-1'))
self.headers =
email.parser.Parser(_class=self.MessageClass).parse(hfile)

Thus, any WSGI server based on that would have no chance of getting
access to headers in byte form.

Graham

2008/9/30 Graham Dumpleton <graham.d...@gmail.com>:

Clodoaldo Pinto Neto

unread,
Sep 30, 2008, 7:32:58 AM9/30/08
to mod...@googlegroups.com
2008/9/30 Toshio Kuratomi <a.ba...@gmail.com>:

>
>
>
> On Sep 29, 3:24 pm, "Brian Smith" <br...@briansmith.org> wrote:
>> Toshio Kuratomi wrote:
>> > "Graham Dumpleton" wrote:
>> > > As to the HTTP request headers, the RFCs say they are effectively
>> > > latin-1. Thus, all HTTP_? variables in WSGI environ can only be
>> > > processed as latin-1 when converting toUnicode.
>>
>> > Converting these headers tounicodewill lead to mangled data
>> > at times. Let's say that some web app needs to keep track of
>> > the referer information for some reason. If the app is
>> > referred to fromhttp://localhost/€.html ("Euro symbol".html
>> > ) and it is encoded as
>> > utf-8 on the server then the server will send a header with
>> > this sequence of bytes::
>>
>> > Referer http://localhost/%e2%82%ac.html
>>
>> > If mod_wsgi assumes latin-1 and converts that intounicode
>> > before it hits the app, the app will see this::
>>
>> > Refererhttp://localhost/â%82¬.html
>>
>> No, it will leave it ashttp://localhost/%e2%82%ac.html. It does (or should do) the Latin-1-to-Unicodeconversion before it decodes URL encoding.
>>
> uhm... you're wrong here. url encoding and decoding operates on
> bytes. unicode is not bytes. so you can't go from byte string to
> unicode and then pass it through url decode. Or I suppose you can,
> but it isn't by any means the opposite of what you did to get the url
> escaped bytes so it's pretty senseless.

I tested that url with Firefox and Opera in Linux utf-8 and what
happens is that Firefox does what Brian says. But testing Firefox in
Windows XP it substitutes € for %80 and IE6 changes € to %e2%82%ac.

Brian Smith

unread,
Sep 30, 2008, 10:13:10 AM9/30/08
to mod...@googlegroups.com
Toshio Kuratomi wrote:
> On Sep 29, 3:24 pm, "Brian Smith" <br...@briansmith.org> wrote:
> > Toshio Kuratomi wrote:
> > > If mod_wsgi assumes latin-1 and converts that intounicode
> > > before it hits the app, the app will see this::
> >
> > > Refererhttp://localhost/â%82¬.html
> >
> > No, it will leave it as http://localhost/%e2%82%ac.html. It
> > does (or should do) the Latin-1-to-Unicodeconversion before
> > it decodes URL encoding.
> >
> uhm... you're wrong here. url encoding and decoding operates
> on bytes. unicode is not bytes. so you can't go from byte
> string to unicode and then pass it through url decode.

Original string in Latin-1: http://localhost/%e2%82%ac.html
Latin-1 to Unicode: http://localhost/%e2%82%ac.html

Since the original Latin-1 string did not contain any non-Latin characters, no codepoint conversions are performed.

> Or I suppose you can, but it isn't by any means the opposite of
> what you did to get the url escaped bytes so it's pretty senseless.

I made a mistake about the *encoding* (not decoding) order in my previous email. I will correct it below.

> > Again, I think you are misunderstanding the interaction
> > between URL encoding and character encoding conversion.
> > Mod_wsgi will (should) never do or undo URL-encoding itself
> > for non-ASCII (%80-%FF) sequences.

> I think that you are misunderstanding the interaction. And I
> thing that % sequences should definitely be done by mod_wsgi.

> Ending up with a unicode string containing %encoded
> sequences is even worse than the other scenarios I described
> as the application then has to convert from unicode to byte
> string, unquote the url quoting, and then convert back to
> unicode.

mod_wsgi cannot decode all the % sequences in headers because it doesn't know which headers contain URIs and which ones don't; many headers can contain % sequences that don't mean the same thing they mean in URIs. Plus, sometimes (many times) the application needs the encoded URI instead of the IRI form. If you are you talking about things like PATH_INFO, SCRIPT_NAME, and REQUEST_URI, doing URI->IRI conversion on them will break applications like mine that already do their own URI->IRI conversion. I should test to see what WSGI gateways actually do there.

> It would be much better for mod_wsgi to do the url quoting
> for the user as converting between bytes and %escape
> sequences is 100% automatable. This is unlike converting
> between unicode and a sequence of bytes where something has
> to decide what the character encoding is. So -- WSGI should
> take care of %encoding because that's a job for a computer
> anyway. WSGI should not take care of the byte=> unicode
> conversion because it doesn't know what enconding the bytes are in.

mod_wsgi already mangles the URI components too much in SCRIPT_NAME and PATH_INFO (in its defense, it does so because CGI/WSGI require it to for the most part, except for "//" munging). That is why I fall back to parsing REQUEST_URI myself.


> > > Now let's look at the reverse case: Let's say that the
> application
> > > wants to redirect the user to €.html (Euro symbol.html).
> For that,
> > > they have to enter this into the location header::
> > > real_url = '€.html'
> > > byte_sequence = real_url.encode('utf-8')
> > > marshalled_form = str(byte_sequence, 'latin-1')
> > > headers = [('location', marshalled_form)]
> >
> > No, they have to URL-encode mashalled_form into ASCII
> first, because the Location header holds a URI, and URIs are
> always ASCII-only.
> >
> Well... between marshalled_form and HTTP HEADER, there needs
> to be a url escaping sequence. but whether that needs to
> happen outside of mod_wsgi or inside is part of what you and
> I are debating. You do see from your example above why your
> initial sequence for decoding at the top of the post is
> wrong, though? Your decoding sequence at the top placed the
> ASCII escaping between byte_sequence and real_url instead of
> between marshalled_form and headers.

Right, I made two mistakes here. First, it doesn't make sense to URL-encode the string AFTER converting it to Latin-1. Instead, you need to URL-encode the string BEFORE converting it to Latin-1. Then, the string will only have ASCII characters. Secondly, you can encode/decode it using whatever encodings you please before you URL-encode it, because the URI and IRI specifications do not require every %XX sequence to decode to a valid UTF-8 sequence. mod_wsgi's own view of the filesystem encoding doesn't matter in this case.

Regards,
Brian


Graham Dumpleton

unread,
Sep 30, 2008, 7:19:38 PM9/30/08
to mod...@googlegroups.com
2008/10/1 Brian Smith <br...@briansmith.org>:

> mod_wsgi already mangles the URI components too much in SCRIPT_NAME and PATH_INFO (in its defense, it does so because CGI/WSGI require it to for the most part, except for "//" munging). That is why I fall back to parsing REQUEST_URI myself.

In my defence I do the leading duplicate slash removal in SCRIPT_NAME
because otherwise different major versions of Apache would behave
differently. Any duplicate slashes otherwise within the path of
SCRIPT_NAME and PATH_INFO are from memory eliminated by Apache itself
and not by mod_wsgi.

Graham

Brian Smith

unread,
Sep 30, 2008, 10:10:57 PM9/30/08
to mod...@googlegroups.com

I understand that you do that for compatibility reasons. I didn't realize
that Apache does it too. Is that done in the CGI environment building
functions or in the core of Apache?

Thanks,
Brian

Toshio Kuratomi

unread,
Oct 1, 2008, 2:23:33 AM10/1/08
to modwsgi
Okay, the sequence: (app) unicode => byte string(app specifies
encoding) => url encoded byte string => (encode to unicode -- optional
because wsgi says either bytes or unicode with preference for bytes)
or (unicode and then latin-1 -- optional because url encoding should
take care of non-ASCII chars and latin-1 is a superset of ASCII) =>
wsgi server would work.

Now does this mean that the right thing for the wsgi server/gateway to
hand to the application is the urlencoded string? ie: 'http://
localhost/%e2%82%ac.html' ? And that value should propogate to
SCRIPT_NAME and PATH_INFO? And that should be true whether you're
dealing with python-2.x or python-3.x? I'm fine if that's the case
because it's still a sequence of bytes... just url encoded and stored
in a (on py3.x) unicode string.

-Toshio

Toshio Kuratomi

unread,
Oct 1, 2008, 2:26:26 AM10/1/08
to modwsgi


On Sep 30, 4:32 am, "Clodoaldo Pinto Neto" <clodoaldo.pi...@gmail.com>
wrote:
>
> I tested that url with Firefox and Opera in Linux utf-8 and what
> happens is that Firefox does what Brian says. But testing Firefox in
> Windows XP it substitutes € for %80 and IE6 changes € to %e2%82%ac.
>
You have to look at what's going on on the server, I'm afraid because
the various clients you use are going to perform various
transformations that may or may not have anything to do with what's
being sent between the wsgi server and the wsgi app.

-Toshio

Graham Dumpleton

unread,
Oct 1, 2008, 2:45:49 AM10/1/08
to mod...@googlegroups.com
2008/10/1 Toshio Kuratomi <a.ba...@gmail.com>:

Can some clearly just tell me what you want me to test.

For Python 3.0, if I use a URL:

/wsgi/scripts/echo3000.py/%E2%82%AC.html

in Safari, where:

/wsgi/scripts/echo3000.py

just echos back WSGI environment, the following happen.

1. Once submit request Safari changes that symbol in URL bar to a Euro symbol.

2. In Apache access logs I get:

::1 - - [01/Oct/2008:16:34:57 +1000] "GET
/wsgi/scripts/echo3000.py/%E2%82%AC.html HTTP/1.1" 200 1858

3. In response to browser, relevant values from WSGI environment are:

PATH_INFO: '/â\x82¬.html'
PATH_TRANSLATED: '/usr/local/apache-2.2.4/htdocs/â\x82¬.html'
QUERY_STRING: ''
REQUEST_METHOD: 'GET'
REQUEST_URI: '/wsgi/scripts/echo3000.py/%E2%82%AC.html'
SCRIPT_FILENAME: '/usr/local/wsgi/scripts/echo3000.py'
SCRIPT_NAME: '/wsgi/scripts/echo3000.py'

Remember that this is what Apache passes and all mod_wsgi is doing is
converting them to Unicode string as latin-1.

For Python 2.3 get:

1. Safari does same obviously.

2. In Apache access logs, also same:

::1 - - [01/Oct/2008:16:41:29 +1000] "GET
/wsgi/scripts/echo.py/%E2%82%AC.html HTTP/1.1" 200 7118

3. In echoed response get:

PATH_INFO: '/\xe2\x82\xac.html'
PATH_TRANSLATED: '/usr/local/apache-2.2.4/htdocs/\xe2\x82\xac.html'
QUERY_STRING: ''
REQUEST_METHOD: 'GET'
REQUEST_URI: '/wsgi/scripts/echo.py/%E2%82%AC.html'
SCRIPT_FILENAME: '/usr/local/wsgi/scripts/echo.py'
SCRIPT_NAME: '/wsgi/scripts/echo.py'

The difference here obviously being that in Python 2.3 they aren't
Unicode strings byte Python 2.X byte strings (ie. conventional
string).

I'll try and update my script to put a link onto itself so can check
what referrer says on click though.

Graham

Graham Dumpleton

unread,
Oct 1, 2008, 2:53:52 AM10/1/08
to mod...@googlegroups.com
2008/10/1 Graham Dumpleton <graham.d...@gmail.com>:

Better still, someone construct a small WSGI application etc which
does what you want to test various cases and I'll run it under Python
2.3 and 3.0. If script is for Python 2.X, I can convert it to Python
3.0 if need be.

Graham

Toshio Kuratomi

unread,
Oct 1, 2008, 3:21:09 AM10/1/08
to modwsgi
On Sep 30, 1:02 am, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> Can we stop with the mod_wsgi should do this or mod_wsgi should do
> that. The Apache/mod_wsgi module is just one implementation of the
> WSGI specification. You need when talking about this to look at the
> bigger picture and what other implementations exist, plus how they all
> work and interact with the web server they use.
>
Sorry. I mean what you're saying, I really do. I'll try to be more
exact in my nouns from now on.

> Take CGI for example. If you are using a CGI-WSGI adapter, the WSGI
> environment will come in through os.environ. If you run Python 3.0 and
> look at os.environ you will get:
>
> Python 3.0rc1 (r30rc1:66499, Sep 18 2008, 21:39:06)
> [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.>>> import os
> >>> os.environ['PATH']
>
> '/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/ose/bin:/usr/local/bin:/Users/grahamd/bin'>>> type(os.environ['PATH'])
>
> <class 'str'>
>
> So, os.environ already holds values as Unicode string objects and not
> bytes. Thus there is no chance of them being passed to application as
> bytes.
>
> How they get to become Unicode strings depend on the platform. For
> Windows it uses:
>
>   PyUnicode_FromWideChar()
>
> So, input is Unicode to begin with.
>
> On UNIX boxes it uses:
>
>   PyUnicode_FromString()
>
> which presumably means it uses default system encoding whatever that might be.
>
This is actually broken :-( I just tried adding a non-utf-8 directory
to my PATH (perfectly legal even if Brian and I agree that it's not
the sanest thing to do :-) and then looking for it in os.environ. The
result is a silent failure inside of python... There were no
tracebacks, warnings, or errors but os.environ['PATH'] gave a
KeyError.... putting the non-utf8 directory in there made python not
load PATH from the environment.

The correct thing to do is debatable. os.listdir() handles it a bit
better:
>>>os.listdir('.')
[b'\xbd\xf1', '½ñ']

Not great since you then have both byte strings and unicode strings in
your returned list but better than silently denying its presence.
There's a bug open on os.listdir(). I think it's working to have
os.listdir() return all bytes in some cases.

-Toshio

Toshio Kuratomi

unread,
Oct 1, 2008, 3:41:16 AM10/1/08
to modwsgi

Thanks Graham!
On Sep 30, 11:45 pm, "Graham Dumpleton" <graham.dumple...@gmail.com>
wrote:
> 2008/10/1 Toshio Kuratomi <a.bad...@gmail.com>:
>
>
> Can some clearly just tell me what you want me to test.
>
> For Python 3.0, if I use a URL:
>
>   /wsgi/scripts/echo3000.py/%E2%82%AC.html
>
> in Safari, where:
>
>   /wsgi/scripts/echo3000.py
>
> just echos back WSGI environment, the following happen.
>
> 1. Once submit request Safari changes that symbol in URL bar to a Euro symbol.
>
> 2. In Apache access logs I get:
>
> ::1 - - [01/Oct/2008:16:34:57 +1000] "GET
> /wsgi/scripts/echo3000.py/%E2%82%AC.html HTTP/1.1" 200 1858
>
> 3. In response to browser, relevant values from WSGI environment are:
>
> PATH_INFO: '/â\x82¬.html'

This is what Brian and I need to see. I think that he and I both
think this is incorrect. However, after Brian's last message I'm
unsure of whether it should be '%E2%82%AC.html' or
b'\xe2\x82\xac.html'. I'm pretty sure if it's the latter I need to go
to the python-web-sig and try to get the wsgi spec changed. If it's
the former, I don't know if it's a problem with the spec or just how
it's being interpreted. Although, if paste's httpserver also gives
this value, then it's something that should be clarified at the wsgi
spec level as there's likely a lot of wsgi servers doing it that way.

> PATH_TRANSLATED: '/usr/local/apache-2.2.4/htdocs/â\x82¬.html'
> QUERY_STRING: ''
> REQUEST_METHOD: 'GET'
> REQUEST_URI: '/wsgi/scripts/echo3000.py/%E2%82%AC.html'
> SCRIPT_FILENAME: '/usr/local/wsgi/scripts/echo3000.py'
> SCRIPT_NAME: '/wsgi/scripts/echo3000.py'
>
> Remember that this is what Apache passes and all mod_wsgi is doing is
> converting them to Unicode string as latin-1.
>
> For Python 2.3 get:
>
> 1. Safari does same obviously.
>
> 2. In Apache access logs, also same:
>
> ::1 - - [01/Oct/2008:16:41:29 +1000] "GET
> /wsgi/scripts/echo.py/%E2%82%AC.html HTTP/1.1" 200 7118
>
> 3. In echoed response get:
>
> PATH_INFO: '/\xe2\x82\xac.html'

I think this value is fine. I think Brian considers this value to be
wrong and that '%e2%82%ac.html' is the correct output.

Brian, could you comment on these values and then maybe we can
approach the python-web-sig together?

> PATH_TRANSLATED: '/usr/local/apache-2.2.4/htdocs/\xe2\x82\xac.html'
> QUERY_STRING: ''
> REQUEST_METHOD: 'GET'
> REQUEST_URI: '/wsgi/scripts/echo.py/%E2%82%AC.html'
> SCRIPT_FILENAME: '/usr/local/wsgi/scripts/echo.py'
> SCRIPT_NAME: '/wsgi/scripts/echo.py'
>
> The difference here obviously being that in Python 2.3 they aren't
> Unicode strings byte Python 2.X byte strings (ie. conventional
> string).
>
> I'll try and update my script to put a link onto itself so can check
> what referrer says on click though.
>
> Graham

Thanks Graham!

-Toshio

Graham Dumpleton

unread,
Oct 1, 2008, 3:47:35 AM10/1/08
to mod...@googlegroups.com
2008/10/1 Toshio Kuratomi <a.ba...@gmail.com>:

Before going off to Python web-sig, like to get the WSGI example
program that demonstrates issues done first. Also would like to do
equivalent for normal CGI and show how CGI would work for Python 2.X
and 3.0 on Apache as well. That way one is taking mod_wsgi out of the
picture and possibly show that maybe it is in part the way that Apache
sets up data is the issue, although how os.environ works in Python 3.0
may also be an issue for CGI. Only other Python 3.0 WSGI server I know
of is wgsiref in Python 3.0, so should see how that works as well. For
Python 2.X, should also try CherryPy WSGI server and Paste server.

Graham

Toshio Kuratomi

unread,
Oct 2, 2008, 9:38:26 PM10/2/08
to modwsgi


On Oct 1, 12:41 am, Toshio Kuratomi <a.bad...@gmail.com> wrote:
> Thanks Graham!
> On Sep 30, 11:45 pm, "Graham Dumpleton" <graham.dumple...@gmail.com>
> wrote:
> > For Python 3.0, if I use a URL:
>
> >   /wsgi/scripts/echo3000.py/%E2%82%AC.html
>
> > in Safari, where:
>
> >   /wsgi/scripts/echo3000.py
>
> > just echos back WSGI environment, the following happen.
>
> > 1. Once submit request Safari changes that symbol in URL bar to a Euro symbol.
>
> > 2. In Apache access logs I get:
>
> > ::1 - - [01/Oct/2008:16:34:57 +1000] "GET
> > /wsgi/scripts/echo3000.py/%E2%82%AC.html HTTP/1.1" 200 1858
>
> > 3. In response to browser, relevant values from WSGI environment are:
>
> > PATH_INFO: '/â\x82¬.html'
>
> This is what Brian and I need to see.  I think that he and I both
> think this is incorrect.  However, after Brian's last message I'm
> unsure of whether it should be  '%E2%82%AC.html' or
> b'\xe2\x82\xac.html'.  I'm pretty sure if it's the latter I need to go
> to the python-web-sig and try to get the wsgi spec changed.  If it's
> the former, I don't know if it's a problem with the spec or just how
> it's being interpreted.  Although, if paste's httpserver also gives
> this value, then it's something that should be clarified at the wsgi
> spec level as there's likely a lot of wsgi servers doing it that way.
>
>
Brian, ping. Could you comment on this?

-Toshio

Brian Smith

unread,
Oct 3, 2008, 9:29:47 AM10/3/08
to mod...@googlegroups.com
Toshio Kuratomi wrote:

> > > PATH_INFO: '/â\x82¬.html'
> >
> > This is what Brian and I need to see.  I think that he and I both
> > think this is incorrect.  However, after Brian's last message I'm
> > unsure of whether it should be  '%E2%82%AC.html' or
> > b'\xe2\x82\xac.html'.  I'm pretty sure if it's the latter I
> > need to go to the python-web-sig and try to get the wsgi spec
> > changed. If it's the former, I don't know if it's a problem with
> > the spec or just how it's being interpreted.  Although, if
> > paste's httpserver also gives this value, then it's something
> > that should be clarified at the wsgi spec level as there's likely
> > a lot of wsgi servers doing it that way.
> >
> Brian, ping. Could you comment on this?

Originally, I was going to say that you can find out what should happen by
running the test program in the reference implementation given in PEP 333,
running under Apache as CGI. That, combined with the URL reconstruction
algorithm given in PEP 333 should be enough information to specify the
behavior.

Unfortunately, the URL reconstruction algorithm is based on the standard
quote() function, and that function's semantics have changed substantially
for Python 3.0. In particular, the quote() and unquote() functions in Python
3 assume that you want URI-IRI conversion semantics by default. Not all URIs
are derived from IRIs so that assumption does not work in general. It looks
like a big mess. I'm starting to agree with you that these path-related
variables need to be byte strings. Then at least everything is clear.

In my application I have avoided this issue, and other issues, by using
REQUEST_URI when it is available. In fact, my application really only works
100% correctly when REQUEST_URI is available. Probably the biggest
improvement to WSGI, for applications that are sensitive to these issues,
would be to require REQUEST_URI for all gateways.

Regards,
Brian

Toshio Kuratomi

unread,
Oct 8, 2008, 10:01:44 AM10/8/08
to modwsgi
Here's a simple paste app that displays the environment. Just start it
and then hit it with any URL. I used: http://localhost:8080/ñ/ó?q=©

In python-2.x:
PATH_INFO is a byte string
QUERY_STRING becomes a urlencoded byte string
REQUEST_URI is not present

This is the same behaviour as mod_wsgi on python-2.x.

We really need to see how other wsgi implementations treat PATH_INFO on
python-3.x.

-Toshio

signature.asc
serve-branches
Reply all
Reply to author
Forward
0 new messages