The definitions as they stand are clear enough to understand and implement, but not currently in spec-worthy language. (e.g. it says "should" and "may" in a colloquial fashion, but actually means MUST in some places and SHOULD in others, as defined by RFC 2119)
Thus, I'd like to suggest that Graham (if he's willing?) should reformat the "Definition"/"Ammendments" as an actual diff against the current PEP 333. Then, I will recommend adopting that document as an actual standard WSGI 1.1, to replace PEP 333.
This discussion has gone on long enough, and it doesn't really matter as much to have the perfect API, as it does to have a standard.
James
[1] http://code.google.com/p/modwsgi/wiki/SupportForPython3X
[2] http://www.wsgi.org/wsgi/Amendments_1.0
_______________________________________________
Web-SIG mailing list
Web...@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: http://mail.python.org/mailman/options/web-sig/python-web-sig-garchive-9074%40googlegroups.com
http://listtree.appspot.com/wsgi2/ICvaujouPxb2gfEhDS_aiw
-- Aaron Watters
--- On Thu, 11/26/09, James Y Knight <fo...@fuhm.net> wrote:
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/arw1961%40yahoo.com
I'm +1, with a few caveats. First, as you mention, it needs to be
spec'd properly. In particular, it should be clarified that the main
changes are to *allow byte strings* in certain places where WSGI 1.0
demands a unicode string w/latin-1 encoding.
Second, I do not think that the "additional guarantees/requirements"
can be safely added to a 1.x version, as they make it impossible for
an app to tell whether it's *really* running under 1.1 or under a
broken piece of middleware that's passing through wsgi.version but
not actually providing 1.1-level guarantees. I would therefore
suggest that these additional guarantees and requirements be deferred
to WSGI 2.0.
Okay, let's look at these additional requirements in more detail. I see 4 that should be kept, 1 that can be dispensed with, and 1 I'm not sure about.
> 1. The 'readline()' function of 'wsgi.input' may optionally take a size hint.
Already de-facto required. Leaving it out helps no-one. KEEP.
> 2. The 'wsgi.input' must provide an empty string as end of input stream marker.
I don't think this will be a problem. What would WSGI middleware do to break this requirement? It was only put in in the first place so that CGI adapters could pass through their input stream (which may not ever provide an EOF) without having to wrap it. I agree that was a mistake, and should be corrected. KEEP.
> 3. The size argument to 'read()' function of 'wsgi.input' would be optional and if not supplied the function would return all available request content. Thus would make 'wsgi.input' more file like as the WSGI specification suggests it is, but isn't really per original definition.
This one could be a problem with middleware, and that feature shouldn't ever be used, in any case: reading into memory an arbitrary amount of data from a client is not a good thing to encourage. OMIT.
> 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour the Content-Length response header and must only return from the file that amount of content. This would guarantee that using wsgi.file_wrapper to return part of a file for byte range requests would work.
Given item #6, I suppose this is actually just a matter of efficiency, in case the file wrapper is sent to a middleware rather than directly to the wsgi gateway? If it goes directly to the gateway, that can of course stop reading by itself. ?undecided?
> 5. Any WSGI application or middleware should not return more data than specified by the Content-Length response header if defined.
As long as this is meant as "SHOULD", that's fine. It's not actually a requirement, but rather a suggestion of best practices. KEEP.
> 6. The WSGI adapter must not pass on to the server any data above what the Content-Length response header defines if supplied.
This is already required by HTTP. If the WSGI gateway doesn't make this happen somehow, it's generating invalid HTTP and that's a bug. Okay to clarify in the spec to ensure people don't miss the requirement when implementing. KEEP.
James
I agree with 2 of your keeps, and remain -0.5 to -1 on the
others. See below...
> > 1. The 'readline()' function of 'wsgi.input' may optionally take
> a size hint.
>
>Already de-facto required. Leaving it out helps no-one. KEEP.
Fair enough, since it's a MAY. On the other hand, because it's a
MAY, it actually *helps* no-one, from a spec compatibility
POV. (That is, you have to test whether it's available, so it's no
different than it not being in the spec to begin with.)
So, putting it in doesn't *hurt*, but neither does it *help*... so I
lean towards leaving it to 2.x, where it can actually help.
> > 2. The 'wsgi.input' must provide an empty string as end of input
> stream marker.
>
>I don't think this will be a problem. What would WSGI middleware do
>to break this requirement?
It could be reading the original input stream, and replacing it with
another one. Not very common I would guess, but it's still possible
for a piece of perfectly valid 1.0 middleware to fail this
requirement for 1.1, leading to the condition where you really can't
tell if you're running valid 1.1 or not.
>It was only put in in the first place so that CGI adapters could
>pass through their input stream (which may not ever provide an EOF)
>without having to wrap it. I agree that was a mistake, and should be
>corrected.
I agree... but only in 2.x.
> > 3. The size argument to 'read()' function of 'wsgi.input' would
> be optional and if not supplied the function would return all
> available request content. Thus would make 'wsgi.input' more file
> like as the WSGI specification suggests it is, but isn't really per
> original definition.
>
>This one could be a problem with middleware, and that feature
>shouldn't ever be used, in any case: reading into memory an
>arbitrary amount of data from a client is not a good thing to encourage. OMIT.
Agreed -- even in 2.x it's questionable if not harmful.
> > 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must
> honour the Content-Length response header and must only return from
> the file that amount of content. This would guarantee that using
> wsgi.file_wrapper to return part of a file for byte range requests would work.
>
>Given item #6, I suppose this is actually just a matter of
>efficiency, in case the file wrapper is sent to a middleware rather
>than directly to the wsgi gateway? If it goes directly to the
>gateway, that can of course stop reading by itself. ?undecided?
I don't really see how this one helps anything in 1.x, and so lean
towards leaving it out.
> > 5. Any WSGI application or middleware should not return more data
> than specified by the Content-Length response header if defined.
>
>As long as this is meant as "SHOULD", that's fine. It's not actually
>a requirement, but rather a suggestion of best practices. KEEP.
>
> > 6. The WSGI adapter must not pass on to the server any data above
> what the Content-Length response header defines if supplied.
>
>This is already required by HTTP. If the WSGI gateway doesn't make
>this happen somehow, it's generating invalid HTTP and that's a bug.
>Okay to clarify in the spec to ensure people don't miss the
>requirement when implementing. KEEP.
Good points - I agree with these two, and they can be considered 1.0
clarifications as well. After the first four (which I see no reason
to include) I was probably a little over-inclined to throw these two
out (especially since I was reading the "should" above as a "must",
per your proposal).
Fair enough, since it's a MAY. On the other hand, because it's a MAY, it actually *helps* no-one, from a spec compatibility POV. (That is, you have to test whether it's available, so it's no different than it not being in the spec to begin with.)
> 1. The 'readline()' function of 'wsgi.input' may optionally take a size hint.
Already de-facto required. Leaving it out helps no-one. KEEP.
So, putting it in doesn't *hurt*, but neither does it *help*... so I lean towards leaving it to 2.x, where it can actually help.
It could be reading the original input stream, and replacing it with another one. Not very common I would guess, but it's still possible for a piece of perfectly valid 1.0 middleware to fail this requirement for 1.1, leading to the condition where you really can't tell if you're running valid 1.1 or not.> 2. The 'wsgi.input' must provide an empty string as end of input stream marker.
I don't think this will be a problem. What would WSGI middleware do to break this requirement?
I agree... but only in 2.x.It was only put in in the first place so that CGI adapters could pass through their input stream (which may not ever provide an EOF) without having to wrap it. I agree that was a mistake, and should be corrected.
Agreed -- even in 2.x it's questionable if not harmful.
> 3. The size argument to 'read()' function of 'wsgi.input' would be optional and if not supplied the function would return all available request content. Thus would make 'wsgi.input' more file like as the WSGI specification suggests it is, but isn't really per original definition.
This one could be a problem with middleware, and that feature shouldn't ever be used, in any case: reading into memory an arbitrary amount of data from a client is not a good thing to encourage. OMIT.
I don't really see how this one helps anything in 1.x, and so lean towards leaving it out.> 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour the Content-Length response header and must only return from the file that amount of content. This would guarantee that using wsgi.file_wrapper to return part of a file for byte range requests would work.
Given item #6, I suppose this is actually just a matter of efficiency, in case the file wrapper is sent to a middleware rather than directly to the wsgi gateway? If it goes directly to the gateway, that can of course stop reading by itself. ?undecided?
Good points - I agree with these two, and they can be considered 1.0 clarifications as well. After the first four (which I see no reason to include) I was probably a little over-inclined to throw these two out (especially since I was reading the "should" above as a "must", per your proposal).> 5. Any WSGI application or middleware should not return more data than specified by the Content-Length response header if defined.
As long as this is meant as "SHOULD", that's fine. It's not actually a requirement, but rather a suggestion of best practices. KEEP.
> 6. The WSGI adapter must not pass on to the server any data above what the Content-Length response header defines if supplied.
This is already required by HTTP. If the WSGI gateway doesn't make this happen somehow, it's generating invalid HTTP and that's a bug. Okay to clarify in the spec to ensure people don't miss the requirement when implementing. KEEP.
http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html
I will post again later in detail when have some time to explain a few
more points not mentioned in that post and where people aren't quite
understanding the reasoning for doing things.
One very quick comment about read().
Allowing read() with no argument is no different to a user saying
read(environ['CONTENT_LENGTH']). Because a WSGI adapter/middleware is
going to have to track bytes read to ensure can return an empty string
as end sentinel, it will know length remaining and would internally
for read() with no argument do read(remaining_bytes). As such no real
differences in inefficiencies as far as memory use goes for
implementing read() because of need to implement end sentinel.
Also, you have concerns about read() with no argument, but frankly
readline() with no argument, which is already required, is much worse
because you cant really track bytes read and just read to end of
input. This is because they only want to read to end of line and so
reading all input is going to blow out memory use unreasonably as you
speculate for read(). As such, a readline() implementation is likely
to read in blocks and internally buffer where read() doesn't
necessarily have to.
It may also be pertinent to read:
http://blog.dscpl.com.au/2009/10/wsgi-issues-with-http-head-requests.html
as from memory it talks about issues with not paying attention to
Content-Length on output filtering middleware as well.
As I said, will reply later when have some time to focus. Right now I
have a 2 year old to keep amused.
Graham
2009/11/27 James Y Knight <fo...@fuhm.net>:
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
I still believe there are though underlying problems there in the WSGI
specification and right now, more by luck than design is various stuff
working. In some cases such as readline(), the majority of WSGI
applications/frameworks are in violation of the WSGI 1.0 specification
due to their reliance on cgi.FieldStorage which makes calls to
readline() with an argument.
Either way, since there seemed to be objections at some level on every
point, and since I really really have no enthusiasm for this stuff any
more or of fighting for any change, I retract my personal interest in
having any of the amendments as part of a WSGI 1.1 specification and
will remove all that detail from mod_wsgi documentation. I will
instead replace it with a separate page describing mod_wsgi compliance
with WSGI 1.0 specification and highlighting those specific features
which are in common, or not so common use, via mod_wsgi and which
actually mean that people are writing applications incompatible with
the WSGI 1.0 specification.
To ensure compliance I could well raise Python exceptions for any use
which isn't WSGI 1.0 compliant, but I have already learnt from where I
tried get people to write portable WSGI applications by giving errors
on certain use of stdin and stdout, that it is a pointless battle. All
it got was a long list of users who believe mod_wsgi is broken even
though if they read the actual documentation they would find it was
their own software which was suspect or at least wasn't portable to
all WSGI hosting mechanisms. This would only get worse if exceptions
were raised for use of readline() with an argument and use of read()
with no argument or argument of -1. Short story is that there are a
fair few people who are just lazy, they will always write stuff the
way the want to and not how it should be written. They will always
blame other peoples code for being wrong before acknowledging they
themselves are wrong.
The only answer I therefore need out of WEB-SIG is whether the
qualifications about how Python 3.X is to be supported are going to be
an amendment to WSGI 1.0 or as a separate WSGI 1.1 update and whether
if the latter whether the WSGI 1.1 tag will also have meaning for
Python 2.X.
I need an answer to this so I know whether to withdraw mod_wsgi 3.0
from download and replace it with a mod_wsgi 4.0 which changes the
wsgi.version tuple being passed, for both Python 2.X and Python 3.X,
from (1, 1) back to original (1, 0), given that some opinion seems to
be that any interface changes can only really be performed as part of
WSGI 2.0 and so I would be wrong in using (1, 1).
If don't see an answer, then guess I will just have to revert it back
to (1, 0) to be safe and to avoid any accusations that am highjacking
the process.
An answer sooner rather than later would be appreciated on the
wsgi.version issue.
Graham
2009/11/28 Graham Dumpleton <graham.d...@gmail.com>:
Answering my own question, it is actually obvious that it has to be
called (1, 0). This is because wsgiref in Python 3.X already calls it
(1, 0) and don't have much choice to be in agreement with that.
I will therefore replace mod_wsgi 3.0 with a 4.0 release that reverts
it to (1, 0) from (1, 1) and all the other stuff about amendments can
be ignored.
[...]
> If don't see an answer, then guess I will just have to revert it back
> to (1, 0) to be safe and to avoid any accusations that am highjacking
> the process.
>
> An answer sooner rather than later would be appreciated on the
> wsgi.version issue.
I'd rather appreciate it if you held off on making such changes until either this discussion either peters out or is resolved. You sound somewhat negative, but it seems to me that there's actually quite close to being a consensus on adopting most of your proposal. Changing the proposal out from under us doesn't really help things.
The next step here is clearly for someone to redraft the changes as a diff against PEP 333. If you do not have any interest in being that person, please make that clear, so someone else can step up to do so.
James
No I do not want a part in drafting any changes, I just want to move
on from all this stuff and starting working on other projects. Since
though some don't seem to understand the reasons for the changes then
you will find it hard to find some who is in a position to be able to
do them.
You probably really are just better off worrying about Python 3.X
support and accept that tinkering at edges of WSGI 1.0 on other issues
is not going to solve all the WSGI issues. As PJE suggest, leave that
to an interface incompatible update so that you don't have this whole
problem of what version existing components support.
Graham
> Answering my own question, it is actually obvious that it has to be
> called (1, 0). This is because wsgiref in Python 3.X already calls it
> (1, 0) and don't have much choice to be in agreement with that.
wsgiref.simple_server in Python 3 to date is not something that anyone
should worry about being compatible with. It is a 2to3 hack that cannot
meaningfully claim to represent wsgi version anything.
Careless use of urllib.parse.unquote causes 3.0's simple_server not to
work at all, and 3.1's to mangle the path by treating it as UTF-8
instead of ISO-8859-1, as 'WSGI 1.1' proposed and mod_wsgi (and even
mod_cgi via wsgiref.CGIHandler) delivered.
Yes, I'm always going on about Unicode paths. I'm fed up of shipping
apps with a page-long deployment note about fixing them. It pains me
that in so many years both this and "What do we do about Python 3?"
still haven't been addressed.
mod_wsgi 3.0 already has more traction than wsgiref 3.1 and I would
prefer not to see more farcical reverse-progress at this point.
For what it's worth my responses on the issues of this thread. But at
this point I really just want a BDFL to just come and do it, whatever it
is. A new WSGI, whatever the version number, is massively overdue.
>> 1. The 'readline()' function of 'wsgi.input' may optionally take a
size hint.
Yes. Obviously. Bad practice but unavoidable now. Should have been a 1.0
amendment a long time ago.
>> 2. The 'wsgi.input' must provide an empty string as end of input
stream marker.
>> 3. The size argument to 'read()' function of 'wsgi.input' would be
optional and if not supplied the function would return all available
request content.
>> 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour
the Content-Length response header and must only return from the file
that amount of content.
+0. Seems reasonable but don't massively care. Presumably an application
must refuse to run on 1.0 if it requires these behaviours?
>> 5. Any WSGI application or middleware should not return more data
than specified by the Content-Length response header if defined.
>> 6. The WSGI adapter must not pass on to the server any data above
what the Content-Length response header defines if supplied.
Yes.
--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
Okay, not sensing any other volunteers here...I guess it's all me.
The intention of this spec update is to be compatible with existing middleware/applications when running on Python 2.X. Apps/middleware running on python 3.X require changes in any case, and this specification will tell them exactly what to expect. That Python 3.X middleware and WSGI adapters will have to deal with both bytestrings and unicode strings in many parts of the API (output status code, output headers, output response iterable/write callback) will add some complexity, but that's life.
Any WSGI implementations on Python 3.X claiming compliance to WSGI 1.0 are most likely broken, and its behavior cannot be relied upon. Too bad about wsgiref.
As self-appointed author, I am going to take a stand and say that both the python3-related string-type specifications, and the additional requirements except #3 (read() with no-args) and #4 (file_wrapper looking at Content-Length), will be included.
And it will be called WSGI 1.1.
Back to the list of "extra requirements":
#1: (readline with an arg) must be included, despite the potential for breakage. That ship has already sailed, the breakage has already occurred, it's already required. Disagreement here really is of no consequence.
#2: (wsgi.input() must return EOF at EOF): I do not believe will break any middleware. It will require some changes in some WSGI adapter implementations, but that's acceptable. If you have a real-life example of middleware that would break here, show it. So this will be included.
#3 is not actually required for anything; at best it's an extra convenience; repeatedly reading until EOF will work just as well. Furthermore, the API change has the potential to break some middleware in Python 2.X, so I'll take the safe road and not make the change.
The purpose behind #4 is essentially included in #6, and so is not needed as a separate requirement.
#5 and #6 are uncontroversial and of no impact to an already-correct implementation. They will be included.
I'll send a diff of the actual wording changes once I've written it.
Hi.
Just a few questions.
It is true that HTTP headers can be encoded assuming latin-1; and they
can be encoded using PEP 383.
However what about URI (that is, for PATH_INFO and the like)?
For URI (if I remember correctly) the suggested encoding is UTF-8, so
URLS should be decoded using
url.decode('utf-8', 'surrogateescape')
Is this correct?
Now another question.
Let's consider the `wsgiref.util.application_uri` function
def application_uri(environ):
url = environ['wsgi.url_scheme']+'://'
from urllib.parse import quote
if environ.get('HTTP_HOST'):
url += environ['HTTP_HOST']
else:
url += environ['SERVER_NAME']
if environ['wsgi.url_scheme'] == 'https':
if environ['SERVER_PORT'] != '443':
url += ':' + environ['SERVER_PORT']
else:
if environ['SERVER_PORT'] != '80':
url += ':' + environ['SERVER_PORT']
url += quote(environ.get('SCRIPT_NAME') or '/')
return url
There is a potential problem, here, with the quote function.
This function does the following:
def quote(string, safe='/', encoding=None, errors=None):
if isinstance(string, str):
if encoding is None:
encoding = 'utf-8'
if errors is None:
errors = 'strict'
string = string.encode(encoding, errors)
This means that if we use surrogateescape, the informations about
original bytes is lost here.
This can be easily fixed by changing the application_uri function, but
this also means that a WSGI application will not work with Python 3.1.x.
Finally, a question about cookies.
Cookie data SHOULD be transparent to the server/gateway; however WSGI is
going to assume that data is encoded in latin-1.
I don't know what the HTTP/Cookie spec says about this.
However, from a WSGI application point of view, the cookie data can, as
an example, contain some text encoded in UTF-8; this means that the
application must first encode the data:
cookie_bytes = cookie.encode('latin-1', 'surrogateescape')
and then decode it using UTF-8:
my_cookie_data = cookie_bytes.decode('utf-8')
This is a bit unreasonable, but I don't know if this is a common
practice (I do this, just to make an example).
Manlio Perillo
> However what about URI (that is, for PATH_INFO and the like)?
> For URI (if I remember correctly) the suggested encoding is UTF-8, so
> URLS should be decoded using
> url.decode('utf-8', 'surrogateescape')
> Is this correct?
The currently-discussed proposal is ISO-8859-1, allowing the real bytes
to be trivially extracted. This is consistent with the other headers and
would be my preferred approach.
Python 3.1's wsgiref.simple_server, on the other hand, blindly uses
urllib.unquote, which defaults to UTF-8 without surrogateescape,
mangling any non-UTF-8 input.
I don't really care whether UTF-8+surrogateescape or ISO-8859-1 encoding
is blessed. But *something* needs to be blessed. An encoding, an
alternative undecoded path_info, both, something else... just *something*.
> Let's consider the `wsgiref.util.application_uri` function
> There is a potential problem, here, with the quote function.
Yes. wsgiref is broken in Python 3.1. Not quite as broken as it was in
3.0, but still broken. Until we can come to a Pronouncement on what WSGI
*is* in Python 3, it is meaningless anyway.
> Cookie data SHOULD be transparent to the server/gateway; however WSGI is
> going to assume that data is encoded in latin-1.
Yeah. This is no big deal because non-ASCII characters in cookies are
already broken everywhere(*). Given this and other limitations on what
characters can go in cookies, they are habitually encoded using ad-hoc
mechanisms handled by the application (typically a round of URL-encoding).
*: in particular:
- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
mangling any characters that don't fit in the codepage through the
traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
gets through but everything else is mangled)
- Safari refuses to send any cookie containing non-ASCII characters.
> I don't know what the HTTP/Cookie spec says about this.
The traditional interpretation of RFC2616 is that headers are ISO-8859-1.
You will notice that no browser correctly follows this.
...sigh.
--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
_______________________________________________
Thanks for this summary.
I think it should go in a wiki or in a separate document (like
rationale) to the WSGI spec.
However this should never happen with cookie, since cookie data is
opaque to browser, and it MUST send it "as is".
What you describe happen with other headers containing TEXT.
And now I understand that strange behaviour of Firefox with non latin-1
strings in username, in HTTP Basic Authentication.
> [...]
Regards Manlio
Right, for WSGI 1.1 on Python 3.x, 8859-1 strings is the plan. Other, more ideologically pure options can be discussed for an incompatible revision of WSGI (e.g. the hypothetical 2.0).
BTW: I hope to have a first draft of the changes by Monday. (But don't beat up on me if it's delayed; I am working on it.)
James
The RFC 2109 & 2965 say that a cookie's value can be anything:
> The VALUE is opaque to the user agent and may be anything the origin
> server chooses to send, possibly in a server-selected printable ASCII
> encoding.
Theoricaly you could put something like: 'foo\n\0bar' in a cookie.
Also a cookie can include comments which have to be encoded using ...
UTF-8:
> Comment=value
> OPTIONAL. Because cookies can be used to derive or store
> private information about a user, the value of the Comment
> attribute allows an origin server to document how it intends to
> use the cookie. The user can inspect the information to decide
> whether to initiate or continue a session with this cookie.
> Characters in value MUST be in UTF-8 encoding.
--
Henry Prêcheur
There is something that I don't understand.
Some HTTP headers, like Accept-Language, contains data described as
`token`, where:
token = 1*<any CHAR except CTLs or separators>
So a token, IMHO, is an opaque string, and it SHOULD not decoded.
In Python 3.x it SHOULD be a byte string.
Text content is described as `TEXT`, where:
The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].
TEXT = <any OCTET except CTLs,
but including LWS>
The only type of data where TEXT can be used is `quoted-string`.
A `quoted-string` only appears in well specified portions of an header.
So, IMHO, it is *not* correct for a WSGI middleware, to return all HTTP
headers as Unicode strings.
This is up to the application/framework, that must parse each header,
split it in component and handle them as more appropriate (as byte
string, Unicode string or instance of some other data type).
> [...]
Regards Manlio
I think this is more an issue that frameworks should deal with. By
decoding every headers value to latin-1:
* It keeps WSGI simple. Simple is good.
* WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
says. WSGI is about HTTP, but that doesn't necessarily includes all
other standards extending HTTP.
* It's possible to convert latin-1 strings to bytes without losing data.
--
Henry Prêcheur
> Words of *TEXT MAY contain characters from character sets other than
> ISO-8859-1 [22] only when encoded according to the rules of RFC 2047
Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to
RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself
specifically denies that an encoded-word can go in a quoted-string.
RFC2047 encoded-words are not on-topic in an HTTP header(*); this has
been confirmed by newer development work on HTTPbis by Reschke et al.
(http://tools.ietf.org/wg/httpbis/).
The "correct" way of escaping header parameters in an RFC*822-family
protocol would be RFC2231's complex encoding scheme, but HTTP is
explicitly not an 822-family protocol despite sharing many of the same
constructs. See
http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a
strategy for how 2231 should interact with HTTP, but note that for now
RFC2231-in-HTTP simply does not exist in any deployed tools.
So for now there is basically nothing useful WSGI can do other than
provide direct, byte-oriented (even if wrapped in 8859-1 unicode
strings) access to headers.
--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
_______________________________________________
It is just as simple as using byte strings, IMHO.
It is not simple, it is convenient because of (if I understand
correctly) how code is converted by 2to3.
> * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
> says. WSGI is about HTTP, but that doesn't necessarily includes all
> other standards extending HTTP.
>
HTTP never says to consided whole headers as latin-1 text, IMHO.
> * It's possible to convert latin-1 strings to bytes without losing data.
>
Yes, but it is quite stupid to first convert to Unicode and then convert
again to byte string.
It it true, however, that this does not happen often; but only for:
- WSGI applications that implement an HTTP proxy
- WSGI applications that needs to support HTTP Digest Authentication
- WSGI applications that store encoded data in cookies
Regards Manlio
Thanks.
HTTPbis seems to fix all these problems:
"Historically, HTTP has allowed field content with text in the ISO-
8859-1 [ISO-8859-1] character encoding and supported other character
sets only through use of [RFC2047] encoding. In practice, most HTTP
header field values use only a subset of the US-ASCII character
encoding [USASCII]. Newly defined header fields SHOULD limit their
field values to US-ASCII characters. Recipients SHOULD treat other
(obs-text) octets in field content as opaque data."
This is the new rule for `quoted-string`:
quoted-string = DQUOTE *( qdtext / quoted-pair ) DQUOTE
qdtext = OWS / %x21 / %x23-5B / %x5D-7E / obs-text
; OWS / <VCHAR except DQUOTE and "\"> / obs-text
obs-text = %x80-FF
quoted-pair = "\" ( WSP / VCHAR / obs-text )
> The "correct" way of escaping header parameters in an RFC*822-family
> protocol would be RFC2231's complex encoding scheme, but HTTP is
> explicitly not an 822-family protocol despite sharing many of the same
> constructs. See
> http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a
> strategy for how 2231 should interact with HTTP, but note that for now
> RFC2231-in-HTTP simply does not exist in any deployed tools.
>
It seems reasonable.
> So for now there is basically nothing useful WSGI can do other than
> provide direct, byte-oriented (even if wrapped in 8859-1 unicode
> strings) access to headers.
>
Yes, this is what I think.
I have some doubts about wrapping the headers in 8859-1 unicode strings,
but luckily there is surrogateescape.
Regards Manlio
No, it's not. There were lots of dicussions regarding this on the
mailing list. One of the main issue is that the standard library
supports bytes poorly. urllib for example expects strings not bytes.
> > * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
> > says. WSGI is about HTTP, but that doesn't necessarily includes all
> > other standards extending HTTP.
> >
>
> HTTP never says to consided whole headers as latin-1 text, IMHO.
It does:
When no explicit charset parameter is provided by the sender, media
subtypes of the "text" type are defined to have a default charset value
of "ISO-8859-1" when received via HTTP.
http://tools.ietf.org/html/rfc2616#section-3.7.1
> Yes, but it is quite stupid to first convert to Unicode and then convert
> again to byte string.
99% of the time latin-1 will work. And converting from Unicode to bytes
is not costly.
6 months ago I was a big fan of bytes, but bytes create more problems
than they solve.
--
Henry Prêcheur
I read last month discussions 3 day ago!
The quote function supports byte strings, as an example.
What are the functions that does not works with byte strings?
>>> * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
>>> says. WSGI is about HTTP, but that doesn't necessarily includes all
>>> other standards extending HTTP.
>>>
>> HTTP never says to consided whole headers as latin-1 text, IMHO.
>
> It does:
>
> When no explicit charset parameter is provided by the sender, media
> subtypes of the "text" type are defined to have a default charset value
> of "ISO-8859-1" when received via HTTP.
>
> http://tools.ietf.org/html/rfc2616#section-3.7.1
>
This is not correct.
First of all, HTTP never says that whole headers are of type TEXT.
Only specific components are of type TEXT.
Moreover, HTTPbis has finally clarified this; TEXT is no more used,
instead non ascii characters are to be considered opaque.
Do you really want to define the new WSGI specification to be "against"
the new (possible) HTTP spec?
Of course it will work; but since some code in the standard library
needs to be fixed (the wsgiref.util.application_uri, as an example),
maybe it is better to fix it to work with byte strings.
Just my two cents.
> [...]
Regards Manlio
Just to make things clear, I was talking about Python 3.
All the functions I tried not ending with _from_bytes raise an exception
with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse
which are rather critical ...
> First of all, HTTP never says that whole headers are of type TEXT.
> Only specific components are of type TEXT.
If parts of a header contain latin-1 characters, that means its
encoding is latin-1 (at least partially).
> Moreover, HTTPbis has finally clarified this; TEXT is no more used,
> instead non ascii characters are to be considered opaque.
Yes, but the HTTPbis draft also says:
Historically, HTTP has allowed field content with text in the
ISO-8859-1 character encoding.
And WSGI is not about HTTP in a distant future, it's about HTTP right
now.
> Do you really want to define the new WSGI specification to be "against"
> the new (possible) HTTP spec?
I don't know why it would be "against" it. WSGI aims to handle HTTP in
the real world. Just because the HTTPbis spec is released wont take all
the garbage out of the web. There will still be latin-1 strings in
headers passed around for the next 10 years.
--
Henry Prêcheur
I know.
Unfortunately I don't have installed Python 3, I'm just reading the code.
> All the functions I tried not ending with _from_bytes raise an exception
> with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse
> which are rather critical ...
>
Ah, ok.
Can you show me the traceback of parse_qs? Thanks.
>> First of all, HTTP never says that whole headers are of type TEXT.
>> Only specific components are of type TEXT.
>
> If parts of a header contain latin-1 characters, that means its
> encoding is latin-1 (at least partially).
>
This is not completely true.
> [...]
> And WSGI is not about HTTP in a distant future, it's about HTTP right
> now.
>
>> Do you really want to define the new WSGI specification to be "against"
>> the new (possible) HTTP spec?
>
> I don't know why it would be "against" it.
Well, I have quoted it for this reason.
What I mean is that, IMHO:
- Using Unicode strings in WSGI is an abuse of Unicode string
- This abuse is not justified by the HTTP spec
> [...]
Regards Manlio
On 12/4/09 12:50 AM, And Clover wrote:
> So for now there is basically nothing useful WSGI can do other than
> provide direct, byte-oriented (even if wrapped in 8859-1 unicode
> strings) access to headers.
You could argue that this is perhaps a good reason to replace
``environ`` with something that interprets the headers according to how
HTTP is actually used in the real world.
It may be that WSGI should use bytes everywhere and the recommended
usage would be via a decorator (which could cache computations on the
environ dictionary):
e.g. the raw application handler versus one decorated with an imaginary
``webob`` function.
def app(environ, start_response):
...
@webob
def app(request):
...
It is often said that WSGI should be practical, but in actual usage, I
think most developers use a request/response abstraction layer.
Middlewares are usually shrink-wrapped library code that could handle a
bytes-based environ dict (they'd have to explicitly decode the headers
of interest).
\malthe