[Web-SIG] WSGI, Python 3 and Unicode

271 views
Skip to first unread message

Graham Dumpleton

unread,
Dec 6, 2007, 6:13:39 PM12/6/07
to Web SIG
Has anyone had any thoughts about how WSGI is going to made to work
with Python 3?

>From what I understand about changes in Python 3, the main issue seems
to be the removal of string type in its current form.

This is an issue as WSGI specification currently states that status,
header names/values and the items returned by the iterable must all be
string instances. This is done to ensure that the application has done
any conversions from Unicode, where knowledge about encoding would be
known, before being passed to WSGI adapter.

In Python 3 the default for string type objects will effectively be
Unicode. Is WSGI going to be made to somehow cope with that, or will
application instead be required to return byte string objects instead?

We can never seem to get enough momentum going for WSGI 2.0, but with
Python 3 coming along we may not have a choice but to come up with
revised version of specification if we want WSGI to continue through
to Python 3.

Comments.

Graham
_______________________________________________
Web-SIG mailing list
Web...@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/python-web-sig-garchive-9074%40googlegroups.com

Phillip J. Eby

unread,
Dec 6, 2007, 7:15:26 PM12/6/07
to Graham Dumpleton, Web SIG
At 10:13 AM 12/7/2007 +1100, Graham Dumpleton wrote:
>Has anyone had any thoughts about how WSGI is going to made to work
>with Python 3?
>
> >From what I understand about changes in Python 3, the main issue seems
>to be the removal of string type in its current form.
>
>This is an issue as WSGI specification currently states that status,
>header names/values and the items returned by the iterable must all be
>string instances. This is done to ensure that the application has done
>any conversions from Unicode, where knowledge about encoding would be
>known, before being passed to WSGI adapter.
>
>In Python 3 the default for string type objects will effectively be
>Unicode. Is WSGI going to be made to somehow cope with that, or will
>application instead be required to return byte string objects instead?

WSGI already copes, actually. Note that Jython and IronPython have
this issue today, and see:

http://www.python.org/dev/peps/pep-0333/#unicode-issues

"""On Python platforms where the str or StringType type is in fact
Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all
"strings" referred to in this specification must contain only code
points representable in ISO-8859-1 encoding (\u0000 through \u00FF,
inclusive). It is a fatal error for an application to supply strings
containing any other Unicode character or code point. Similarly,
servers and gateways must not supply strings to an application
containing any other Unicode characters."""

Guido van Rossum

unread,
Dec 6, 2007, 7:27:41 PM12/6/07
to Phillip J. Eby, Web SIG, Graham Dumpleton

That may work for IronPython/Jython, where encoded data is represented
by the str type, but it won't be sufficient for Py3k, where encoded
data is represented using the bytes type. IOW, in IronPython/Jython,
u"\u1234".encode('utf-8') returns a str instance: '\xe1\x88\xb4'; but
in Py3k, it returns a bytes instance: b'\xe1\x88\xb4'.

The issue applies to input as well as output -- data read from a
socket is also represented as bytes, unless you're using makefile()
with a text mode and an encoding.

You might want to look at how the unittests for wsgiref manage to pass
in Py3k though. ;-)

--
--Guido van Rossum (home page: http://www.python.org/~guido/)

James Y Knight

unread,
Dec 6, 2007, 7:23:40 PM12/6/07
to Phillip J. Eby, Web SIG

On Dec 6, 2007, at 7:15 PM, Phillip J. Eby wrote:
> WSGI already copes, actually. Note that Jython and IronPython have
> this issue today, and see:
>
> http://www.python.org/dev/peps/pep-0333/#unicode-issues
>
> """On Python platforms where the str or StringType type is in fact
> Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all
> "strings" referred to in this specification must contain only code
> points representable in ISO-8859-1 encoding (\u0000 through \u00FF,
> inclusive). It is a fatal error for an application to supply strings
> containing any other Unicode character or code point. Similarly,
> servers and gateways must not supply strings to an application
> containing any other Unicode characters."""

It would seem very odd, however, for WSGI/python3 to use strings-
restricted-to-0xFF for network I/O while everywhere else in python3 is
going to use bytes for the same purpose. You'd have to modify your app
to call write(unicodetext.encode('utf-8').decode('latin-1')) or so....

James

Adam Atlas

unread,
Dec 6, 2007, 8:08:04 PM12/6/07
to Web SIG

On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
> In Python 3 the default for string type objects will effectively be
> Unicode. Is WSGI going to be made to somehow cope with that, or will
> application instead be required to return byte string objects instead?

I'd say it would be best to only accept `bytes` objects; anything else
would require some guesswork. Maybe, at most, it could try to encode
returned Unicode objects as ISO-8859-1, and have it be an error if
that's not possible.

I was going to say that the gateway could accept Unicode objects if
the user-agent sent a comprehensible Accept-Charset header, and
thereby encode application output to the client's preferred character
set on the fly (or to ISO-8859-1 if no Accept-Charset is provided),
but that would complicate things for people writing gateways (and
would be too implicit). It could be useful, but it would make more
sense as a simple decorator for (almost-)WSGI applications. Perhaps it
could go in wsgiref.

Phillip J. Eby

unread,
Dec 6, 2007, 8:45:47 PM12/6/07
to Guido van Rossum, Web SIG, Graham Dumpleton
At 04:27 PM 12/6/2007 -0800, Guido van Rossum wrote:
>You might want to look at how the unittests for wsgiref manage to pass
>in Py3k though. ;-)

Unless they've been changed, I'd assume it's because they work with
strings exclusively, and never do any encoding or decoding (which is
outside WSGI's scope, at least in the current version).

Phillip J. Eby

unread,
Dec 6, 2007, 8:48:55 PM12/6/07
to Adam Atlas, Web SIG
At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:

>On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
> > In Python 3 the default for string type objects will effectively be
> > Unicode. Is WSGI going to be made to somehow cope with that, or will
> > application instead be required to return byte string objects instead?
>
>I'd say it would be best to only accept `bytes` objects; anything else
>would require some guesswork. Maybe, at most, it could try to encode
>returned Unicode objects as ISO-8859-1, and have it be an error if
>that's not possible.

Actually, I'd prefer to look at it the other way around: a Python 3
WSGI server or middleware *may* accept bytes objects instead of str.

This is relatively easy for the response side of things, but the
request side is rather more difficult, since wsgi.input may need to
be binary rather than text mode. (I think we can reasonably assume
that wsgi.errors is a text mode stream, and should support a
reasonable encoding.)

James Bennett

unread,
Dec 6, 2007, 9:23:23 PM12/6/07
to Web SIG
On Dec 6, 2007 6:15 PM, Phillip J. Eby <p...@telecommunity.com> wrote:
> WSGI already copes, actually. Note that Jython and IronPython have
> this issue today, and see:
>
> http://www.python.org/dev/peps/pep-0333/#unicode-issues

I'm glad you brought that up, because it's been bugging me lately.

That section is somewhat ambiguous as-is, because in one sentence
applications are permitted to return strings encoded in a charset
other than ISO-8859-1, but in another they are unequivocally forbidden
to do so (with the "must not" in bold, even). And that's problematic
not only because of the ambiguity, but because the increasing
popularity of "AJAX" and web-based APIs is making it much more common
for WSGI applications to generate responses of types which do not
default to ISO-8859-1 -- e.g., XML and JSON, both of which default to
UTF-8.

Depending on how draconian one wishes to be when reading the relevant
section of WSGI, it's possible to conclude that XML and JSON must
always be transcoded/escaped to ISO-8859-1 -- with all the headaches
that entails -- before being passed to a WSGI-compliant piece of
software.

And the slightly less strict reading of the spec -- that such
gymnastics are required only when the string type of the Python
implementation is Unicode-based -- will grow increasingly troublesome
as/when Py3K enters production use.

So as long as we're talking about this, could the proscriptions with
respect to encoding perhaps be revisited and (hopefully)
clarified/revised?

--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

Guido van Rossum

unread,
Dec 6, 2007, 10:51:37 PM12/6/07
to Phillip J. Eby, Web SIG, Graham Dumpleton
On Dec 6, 2007 5:45 PM, Phillip J. Eby <p...@telecommunity.com> wrote:
> At 04:27 PM 12/6/2007 -0800, Guido van Rossum wrote:
> >You might want to look at how the unittests for wsgiref manage to pass
> >in Py3k though. ;-)
>
> Unless they've been changed, I'd assume it's because they work with
> strings exclusively, and never do any encoding or decoding (which is
> outside WSGI's scope, at least in the current version).

Indeed, that seems mostly to be the case. But this means that any
application that wants to emit characters outside Latin-1 cannot just
encode() those characters, since the encode() output will be bytes and
those will not be accepted by the WSGI API. OTOH sending non-Latin-1
characters without encoding would violate the standard. So something
needs to give...

--
--Guido van Rossum (home page: http://www.python.org/~guido/)

Ian Bicking

unread,
Dec 6, 2007, 11:00:02 PM12/6/07
to Phillip J. Eby, Web SIG
Phillip J. Eby wrote:
> At 08:08 PM 12/6/2007 -0500, Adam Atlas wrote:
>
>> On 6 Dec 2007, at 18:13, Graham Dumpleton wrote:
>>> In Python 3 the default for string type objects will effectively be
>>> Unicode. Is WSGI going to be made to somehow cope with that, or will
>>> application instead be required to return byte string objects instead?
>> I'd say it would be best to only accept `bytes` objects; anything else
>> would require some guesswork. Maybe, at most, it could try to encode
>> returned Unicode objects as ISO-8859-1, and have it be an error if
>> that's not possible.
>
> Actually, I'd prefer to look at it the other way around: a Python 3
> WSGI server or middleware *may* accept bytes objects instead of str.
>
> This is relatively easy for the response side of things, but the
> request side is rather more difficult, since wsgi.input may need to
> be binary rather than text mode. (I think we can reasonably assume
> that wsgi.errors is a text mode stream, and should support a
> reasonable encoding.)

wsgi.input definitely seems like it should be bytes to me. Unless we
want to put the encoding process into the server. Not entirely
infeasible, but a bit of a strain. And the request body might very well
be binary, e.g., on a PUT.

The CGI keys in the environment don't feel at all like bytes to me, but
then they aren't unicode either. They can be unicode, again given a bit
of work on the server side. Though unfortunately browsers are very poor
at indicating their encoding for requests, and it ends up being policy
and configuration as much as anything that determines the encoding of
stuff like wsgi.input. I believe all request paths are UTF8 (?), but
I'm not sure about QUERY_STRING. I'm a little fuzzy on some of the
details there.

The actual response body should also be bytes. Unless again we want to
introduce upstream encoding.

This does make everything feel more complicated.

--
Ian Bicking : ia...@colorstudy.com : http://blog.ianbicking.org

Guido van Rossum

unread,
Dec 7, 2007, 12:06:26 AM12/7/07
to Ian Bicking, Web SIG

It's the same level of complexity you run into as soon as you want to
handle Unicode with WSGI in 2.x though, as it is caused by something
outside our control (HTTP and browsers).

--
--Guido van Rossum (home page: http://www.python.org/~guido/)

Alan Kennedy

unread,
Dec 7, 2007, 4:39:09 AM12/7/07
to James Y Knight, Web SIG
[Phillip]

>> WSGI already copes, actually. Note that Jython and IronPython have
>> this issue today, and see:
>>
>> http://www.python.org/dev/peps/pep-0333/#unicode-issues

[James]


> It would seem very odd, however, for WSGI/python3 to use strings-
> restricted-to-0xFF for network I/O while everywhere else in python3 is
> going to use bytes for the same purpose.

I think it's worth pointing out the reason for the current restriction
to iso-8859-1 is *because* python did not have a bytes type at the
time the WSGI spec was drawn up. IIRC, the bytes type had not yet even
been proposed for Py3K. Cpython effectively held all byte sequences as
strings, a paradigm which is (still) followed by jython (not sure
about ironpython).

The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
used simply as an identity encoding that also enforces that all
"bytes" in the string have a value from 0x00 to 0xff, so that they are
suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
a byte-oriented protocol. The problem is the python-the-language
didn't have support for bytes at the time WSGI was designed.

[James]


> You'd have to modify your app
> to call write(unicodetext.encode('utf-8').decode('latin-1')) or so....

Did you mean: write(unicodetext.encode('utf-8').encode('latin-1'))?

Either way, the second encode is not required;
write(unicodetext.encode('utf-8')) is sufficient, since it will
generate a byte-sequence(string) which will (actually "should": see
(*) note below) pass the following test.

try:
wsgi_response_data.encode('iso-8859-1')
except UnicodeError:
# Illegal WSGI response data!

On a side note, it's worth noting that Philip Jenvey's excellent
rework of the jython IO subsystem to use java.nio is fundamentally
byte oriented.

http://www.nabble.com/fileno-support-is-not-in-jython.-Reason--t4750734.html
http://fisheye3.cenqua.com/browse/jython/trunk/jython/src/org/python/core/io

Because it is based on the new IO design for Python 3K, as described in PEP 3116

http://www.python.org/dev/peps/pep-3116/

Regards,

Alan.

[*] Although I notice that cpython 2.5, for a reason I don't fully
understand, fails this particular encoding sequence. (Maybe it's to do
with the possibility that the result of an encode operation is no
longer an encodable string?)

Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> response = u"interferon-gamma (IFN-\u03b3) responses in cattle"
>>> response.encode('utf-8').encode('latin-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position
22: ordinal not in range(128)
>>>

Meaning that to enforce the WSGI iso-8859-1 convention on cpython 2.5,
you would have to carry out this rigmarole

>>> response.encode('utf-8').decode('latin-1').encode('latin-1')
'interferon-gamma (IFN-\xce\xb3) responses in cattle'
>>>

Perhaps this behaviour is an artifact of the cpython implementation?

Whereas jython passes it just fine (and correctly, IMHO)

Jython 2.2.1 on java1.4.2_15
Type "copyright", "credits" or "license" for more information.
>>> response = u"interferon-gamma (IFN-\u03b3) responses in cattle"
>>> response.encode('utf-8')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'
>>> response.encode('utf-8').encode('latin-1')
'interferon-gamma (IFN-\xCE\xB3) responses in cattle'

Thomas Broyer

unread,
Dec 7, 2007, 5:16:13 AM12/7/07
to web...@python.org
I wasn't there when PEP-333 was written, nor have I any implication in
any Python development, but here are my thoughts:

2007/12/7, Alan Kennedy:


>
> I think it's worth pointing out the reason for the current restriction
> to iso-8859-1 is *because* python did not have a bytes type at the
> time the WSGI spec was drawn up. IIRC, the bytes type had not yet even
> been proposed for Py3K. Cpython effectively held all byte sequences as
> strings, a paradigm which is (still) followed by jython (not sure
> about ironpython).
>
> The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
> used simply as an identity encoding that also enforces that all
> "bytes" in the string have a value from 0x00 to 0xff, so that they are
> suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
> a byte-oriented protocol. The problem is the python-the-language
> didn't have support for bytes at the time WSGI was designed.

If you're talking about the "output stream", then yes, it's all about
bytes (or should be). But at the status and headers level, HTTP/1.1 is
fundamentally ISO-8859-1-encoded.

See:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2 (the note
about *TEXT)
http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
(field-content is *TEXT, among other things)
http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1
(Reason-Phrase is *TEXT)

--
Thomas Broyer

Alan Kennedy

unread,
Dec 7, 2007, 6:24:03 AM12/7/07
to Thomas Broyer, web...@python.org
[Alan]

>> The restriction to iso-8859-1 is really a distraction; iso-8859-1 is
>> used simply as an identity encoding that also enforces that all
>> "bytes" in the string have a value from 0x00 to 0xff, so that they are
>> suitable for byte-oriented IO. So, in output terms at least, WSGI *is*
>> a byte-oriented protocol. The problem is the python-the-language
>> didn't have support for bytes at the time WSGI was designed.

[Thomas]


> If you're talking about the "output stream", then yes, it's all about
> bytes (or should be).

Indeed, I was only talking about output, specifically the response body.

> But at the status and headers level, HTTP/1.1 is
> fundamentally ISO-8859-1-encoded.

Agreed.

That is why the WSGI spec also states

"""
Note also that strings passed to start_response() as a status or as
response headers must follow RFC 2616 with respect to encoding. That
is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME
encoding.
"""

So in order to use non-ISO-8859-1 characters in response status
strings or headers, you must use RFC 2047.

As confirmed by the links you posted, this is a HTTP restriction, not
a WSGI restriction.

Regards,

Alan.

Andrew Clover

unread,
Dec 7, 2007, 2:11:12 PM12/7/07
to web...@python.org
Adam Atlas <ad...@atlas.st> wrote:

> I'd say it would be best to only accept `bytes` objects

+1. HTTP is inherently byte-based. Any translation between bytes and
unicode characters should be done at a higher level, by whatever web
framework is living above WSGI.

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/

Phillip J. Eby

unread,
Dec 7, 2007, 2:55:47 PM12/7/07
to Guido van Rossum, Ian Bicking, Web SIG
So here are my recommendations so far for the addendum to WSGI *1.0*
for Python 3.0 (I expect we can be more strict for WSGI 2.0):

* When running under Python 3, applications SHOULD produce bytes
output and headers

* When running under Python 3, servers and gateways MUST accept
strings as application output or headers, under the existing rules
(i.e., s.encode('latin-1') must convert the string to bytes without
an exception)

* When running under Python 3, servers MUST provide CGI HTTP
variables as strings, decoded from the headers using HTTP standard
encodings (i.e. latin-1 + RFC 2047) (Open question: are there any
CGI or WSGI variables that should NOT be strings?)

* When running under Python 3, servers MUST make wsgi.input a binary
(byte) stream

* When running under Python 3, servers MUST provide a text stream for
wsgi.errors

These rules are intended to simplify the porting of existing
code. Notice, for example, that these rules allow middleware to pass
strings through unchanged, since they are not required to produce
bytes output or headers.

Unfortunately, wsgi.input can't be coded around, but for most
frameworks this should be a single point of pain. In fact, if the
'cgi' stdlib module is made compatible with bytes, only the rare
framework that rolls its own multipart parser or otherwise directly
manipulates put/post data will be affected. Code that just takes the
input and writes it to a file won't be bothered, either.

Comments or questions?

Ian Bicking

unread,
Dec 7, 2007, 3:24:56 PM12/7/07
to Phillip J. Eby, Web SIG
Phillip J. Eby wrote:
> So here are my recommendations so far for the addendum to WSGI *1.0* for
> Python 3.0 (I expect we can be more strict for WSGI 2.0):
>
> * When running under Python 3, applications SHOULD produce bytes output
> and headers
>
> * When running under Python 3, servers and gateways MUST accept strings
> as application output or headers, under the existing rules (i.e.,
> s.encode('latin-1') must convert the string to bytes without an exception)
>
> * When running under Python 3, servers MUST provide CGI HTTP variables
> as strings, decoded from the headers using HTTP standard encodings (i.e.
> latin-1 + RFC 2047) (Open question: are there any CGI or WSGI variables
> that should NOT be strings?)

I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not latin1.
That is, after you urldecode the values (as WSGI asks you to do)
proper conversion to text is to decode it as UTF8.

I'm a bit confused on how HTTP_COOKIE gets encoded. And QUERY_STRING
also confuses me.

Is this all compatible with os.environ in py3k? I don't care that much
if it does, but as the starting point for CGI it would be interesting if
it stays in sync.

James Y Knight

unread,
Dec 7, 2007, 3:53:03 PM12/7/07
to Phillip J. Eby, Web SIG

On Dec 7, 2007, at 2:55 PM, Phillip J. Eby wrote:

> * When running under Python 3, servers MUST provide CGI HTTP
> variables as strings, decoded from the headers using HTTP standard
> encodings (i.e. latin-1 + RFC 2047) (Open question: are there any
> CGI or WSGI variables that should NOT be strings?)

A WSGI gateway should *not* decode headers using RFC 2047. It actually
*cannot*, without knowing the structure of that particular header,
because only TEXT tokens are encoded that way. In addition, I know of
nobody who actually implements RFC 2047 decoding of http header
values...nothing really uses it. (of course I don't know of all
implementations out there.)


On Dec 7, 2007, at 3:24 PM, Ian Bicking wrote:

> I believe that SCRIPT_NAME/PATH_INFO would be UTF8 encoded, not
> latin1.
> That is, after you urldecode the values (as WSGI asks you to do)
> proper conversion to text is to decode it as UTF8.

Surely not! URLs aren't always utf-8 encoded, only often.

James

Andrew Clover

unread,
Dec 7, 2007, 5:46:32 PM12/7/07
to web...@python.org
James Y Knight wrote:

> In addition, I know of nobody who actually implements RFC 2047
> decoding of http header values...nothing really uses it. (of
> course I don't know of all implementations out there.)

Certainly no browser supports it, which makes the point moot for WSGI.
Most browsers, when quoting a header parameter, simply encode using the
previous page's charset and put quotes around it... even if the
parameter has a quote or control codes in it.

Ian wrote:

> Is this all compatible with os.environ in py3k?

In 3.0a2 os.environ has Unicode strings for both keys and values. This
is correct for Windows where environment variables are explicitly
Unicode, but questionable (IMO) for Unix where they're really bytes that
may or may not represent decodeable Unicode strings.

>> SCRIPT_NAME/PATH_INFO

This already causes problems in Windows CGI applications! Because these
are passed in environment variables, IIS* has to decode the submitted
bytes to Unicode first. It seems always to choose UTF-8 for this job,
which I suppose is the least bad guess, but hardly infallible.

(* - haven't tested this with Apache for Windows yet.)

In Python 2.x, os.environ being byte strings, Python/the C library then
has to encode them back to bytes, which I believe ends up using the
system codepage. Since the system codepage is never UTF-8 on Windows
this means not only that the bytes read back from eg. PATH_INFO are not
the same as the original bytes submitted to the web server, but that if
there are characters outside the system codepage submitted, they'll be
unrecoverable.

If os.environ remains Unicode in Unix and WSGI follows it (as it must if
CGI-invoked WSGI is to continue working smoothly), webapps that try to
allow for non-ASCII characters in URLs are likely to get some nasty
deployment problems that depend on the system encoding setting,
something that will be particularly troublesome for end-users to debug
and fix.

OTOH making the dictionaries reflect the underlying OS's conception of
environment variables means users of os.environ and WSGI will have to be
able to cope with both bytes and unicode, which would also be a big
annoyance.

In summary: urgh, this is all messy and 'orrible.

James Y Knight

unread,
Dec 7, 2007, 6:18:58 PM12/7/07
to Andrew Clover, web...@python.org

On Dec 7, 2007, at 5:46 PM, Andrew Clover wrote:
> OTOH making the dictionaries reflect the underlying OS's conception of
> environment variables means users of os.environ and WSGI will have
> to be
> able to cope with both bytes and unicode, which would also be a big
> annoyance.
>
> In summary: urgh, this is all messy and 'orrible.

I suppose this is more a question for python-dev, but, it'd be really
nice if Python on Windows made it look like the windows system
encoding was always UTF-8. That is, bytestrings used for open/
os.environ/argv/etc. are always encoded/decoded in utf-8, not the
broken-platform-encoding. Then the same code would work just as well
on unix as it does on windows.

Actually, I bet I could implement that today, just by wrapping some
stuff....hmmm...

James

Graham Dumpleton

unread,
Dec 8, 2007, 3:37:59 AM12/8/07
to Phillip J. Eby, Web SIG
On 08/12/2007, Phillip J. Eby <p...@telecommunity.com> wrote:
> * When running under Python 3, servers MUST provide a text stream for
> wsgi.errors

In Python 3, what happens if user code attempts to output to a text
stream a byte string? Ie., what would be displayed?

Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
has to internally map this to a C char* like API for logging that it
would need to apply standard Python encoding to yield usable char*
string for output.

Graham

Guido van Rossum

unread,
Dec 8, 2007, 9:48:56 PM12/8/07
to Graham Dumpleton, Web SIG
On Dec 8, 2007 12:37 AM, Graham Dumpleton <graham.d...@gmail.com> wrote:
> On 08/12/2007, Phillip J. Eby <p...@telecommunity.com> wrote:
> > * When running under Python 3, servers MUST provide a text stream for
> > wsgi.errors
>
> In Python 3, what happens if user code attempts to output to a text
> stream a byte string? Ie., what would be displayed?

Nothing. You get a TypeError.

> Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
> has to internally map this to a C char* like API for logging that it
> would need to apply standard Python encoding to yield usable char*
> string for output.

The encoding can/must be specified per text stream.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)

Graham Dumpleton

unread,
Dec 9, 2007, 10:56:55 PM12/9/07
to Guido van Rossum, Web SIG
On 09/12/2007, Guido van Rossum <gu...@python.org> wrote:
> On Dec 8, 2007 12:37 AM, Graham Dumpleton <graham.d...@gmail.com> wrote:
> > On 08/12/2007, Phillip J. Eby <p...@telecommunity.com> wrote:
> > > * When running under Python 3, servers MUST provide a text stream for
> > > wsgi.errors
> >
> > In Python 3, what happens if user code attempts to output to a text
> > stream a byte string? Ie., what would be displayed?
>
> Nothing. You get a TypeError.

Hmmm, this in itself could be quite a pain for existing code where
people have added debug code to print out details from request headers
(if now to be passed as bytes), or part of the request content.

What is the suggested way of best dumping out bytes for debugging
purposes so one does not have to worry about encoding issues, just use
repr()?

> > Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
> > has to internally map this to a C char* like API for logging that it
> > would need to apply standard Python encoding to yield usable char*
> > string for output.
>
> The encoding can/must be specified per text stream.

But what should the encoding associated with the wsgi.errors stream be?

If code which outputs text to wsgi.errors can use any valid Unicode
character, if one sets it to US-ASCII encoding, then chance that
logging output will fail because of characters not being valid in that
character set. If one instead uses UTF-8, then potentially have issues
where that byte string coming out other end of text stream is passed
to C API functions. Issues might arise here where C API not expecting
variable width character encoding.

I'll freely admit I am not across all this Unicode encode/decode stuff
as I don't generally have to deal with foreign languages, but seems to
be a few missing details in this area which need to be filled out for
a modified WSGI specification.

Graham

Guido van Rossum

unread,
Dec 10, 2007, 1:31:20 PM12/10/07
to Graham Dumpleton, Web SIG
On Dec 9, 2007 7:56 PM, Graham Dumpleton <graham.d...@gmail.com> wrote:
> On 09/12/2007, Guido van Rossum <gu...@python.org> wrote:
> > On Dec 8, 2007 12:37 AM, Graham Dumpleton <graham.d...@gmail.com> wrote:
> > > On 08/12/2007, Phillip J. Eby <p...@telecommunity.com> wrote:
> > > > * When running under Python 3, servers MUST provide a text stream for
> > > > wsgi.errors
> > >
> > > In Python 3, what happens if user code attempts to output to a text
> > > stream a byte string? Ie., what would be displayed?
> >
> > Nothing. You get a TypeError.
>
> Hmmm, this in itself could be quite a pain for existing code where
> people have added debug code to print out details from request headers
> (if now to be passed as bytes), or part of the request content.

Sorry, I was just talking about the write() method on a text stream.
The print() function in 3.0 will print the repr() of the bytes.
Example:

Python 3.0a2 (py3k, Dec 10 2007, 09:38:42)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2


Type "help", "copyright", "credits" or "license" for more information.

>>> a = b"xyz"
>>> print(a)
b'xyz'
>>> b = b"abc\377def"
>>> print(b)
b'abc\xffdef'
>>>

(Note that this works because print() always calls str() on the
argument and bytes.str is defined to be the same as bytes.repr.)

> What is the suggested way of best dumping out bytes for debugging
> purposes so one does not have to worry about encoding issues, just use
> repr()?

Just use print().

> > > Also, if wsgi.errors is a text stream, presume that if a WSGI adapter
> > > has to internally map this to a C char* like API for logging that it
> > > would need to apply standard Python encoding to yield usable char*
> > > string for output.
> >
> > The encoding can/must be specified per text stream.
>
> But what should the encoding associated with the wsgi.errors stream be?

Depends on the platform and your requirements.

> If code which outputs text to wsgi.errors can use any valid Unicode
> character, if one sets it to US-ASCII encoding, then chance that
> logging output will fail because of characters not being valid in that
> character set. If one instead uses UTF-8, then potentially have issues
> where that byte string coming out other end of text stream is passed
> to C API functions. Issues might arise here where C API not expecting
> variable width character encoding.
>
> I'll freely admit I am not across all this Unicode encode/decode stuff
> as I don't generally have to deal with foreign languages, but seems to
> be a few missing details in this area which need to be filled out for
> a modified WSGI specification.

The goal of this part of Py3k is to make it more obvious when you
haven't thought through your encoding issues enough by failing as soon
as (encoded) bytes meet (decoded) characters.

Of course, you can still run into delayed trouble by using an
inappropriate encoding, which only shows up when there is an actual
encoding or decoding error; but at least you will have carefully
distinguished between encoded and decoded text throughout your
program, so the fix is now to change the encoding rather than having
to restructure your code to properly separate encoded and decoded
text.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)

Reply all
Reply to author
Forward
0 new messages