deprecations in webob 1.2

Sergey Schetinin

unread,

Sep 20, 2011, 9:04:33 AM9/20/11

to Paste Users

Please reply to this email if any of the following is true:
* if you ever set req.charset to anything but UTF-8
* if you ever need to process requests that have Content-Type:
...;charset= anything but UTF-8
* if you use Request / Response.unicode_errors
* if you ever used Response.request or Response.environ

-Sergey

--
http://self.maluke.com/

Sergey Schetinin

unread,

Sep 20, 2011, 3:23:21 PM9/20/11

to Paste Users

forgot the mailing list

---------- Forwarded message ----------
From: Sergey Schetinin <mal...@gmail.com>
Date: 20 September 2011 22:22
Subject: Re: [Paste] deprecations in webob 1.2
To: Clay Gerrard <clay.g...@gmail.com>

On 20 September 2011 22:01, Clay Gerrard <clay.g...@gmail.com> wrote:

> On Tue, Sep 20, 2011 at 8:04 AM, Sergey Schetinin <mal...@gmail.com> wrote:
>>
>> Please reply to this email if any of the following is true:
>> * if you ever set req.charset to anything but UTF-8
>

> What about Response.charset ?

No intention to change anything about it.

>>
>> * if you ever need to process requests that have Content-Type:
>> ...;charset= anything but UTF-8
>

> Can you elaborate here? What would happen if a client sent "text/html;
> charset=ISO-8859-4" - Can I still read Request.environ['wsgi.input']
> unmodified?

The situation is this: Chris is working on a py3 port of webob and it
touches pretty much everything (this is in the master branch). I am
working on a different branch that drops py25 support but makes no
effort to be py3 compatible (this is in the sergey-1.2-py2-only
branch). I'm just doing the deprecations already scheduled for 1.2 and
just generally refactoring some things. I'm doing it in a different
branch because I use in-development version for some deployments and
keep changes in this branch mostly safe. None of this is intended for
immediate release.

One of the things I wanted to try and see how it works is to hardcode
the request charset to utf-8 (no passing of charset around etc), so I
did that in my branch. I've also added a webob.transcode module with
middleware to transcode form requests in other encoding to utf-8. Most
people will never need it.

As one of the changes the request object bails when someone tries to
set it's charset to anything but utf-8, it also checks the environ on
creation and in the example you've provided it would not even allow
you instantiate it.

So I wanted to see if this experiment can go into the trunk or not.
And if not, I'd like to have a good idea why do we still support
non-utf-8 encodings as they need plenty of arguments passed around and
even though the request object is generally a view on the environ this
is not entirely true wrt the charset. If one accesses req.POST and
changes charset later POST still thinks it has the old charset and
other similar cases.

So, I see a good case for keeping charset is to allow routing of
requests like you've mentioned. So we can't assume that charset is
utf8, but we can, say, do that for form submissions and leave the rest
of the requests alone. That would require the charset check in
Request.text but nowhere else, I think.

>>
>> * if you use Request / Response.unicode_errors
>> * if you ever used Response.request or Response.environ
>>
>

> Openstack swift uses Response.eviron to make some asserts on a piece of
> middleware.

Can you explain please? What sets the Response.environ? What asserts
do you make?

I really want to deprecate those two attrs for a lot of reasons: they
aren't used in webob (get_response sets it, but for no good reason),
they are prone to creating reference cycles (@wsgify does that for
every request, ugh..) etc.

--
http://self.maluke.com/

Sergey Schetinin

unread,

Sep 20, 2011, 3:36:18 PM9/20/11

to Paste Users

> So, I see a good case for keeping charset is to allow routing of
> requests like you've mentioned. So we can't assume that charset is
> utf8, but we can, say, do that for form submissions and leave the rest
> of the requests alone. That would require the charset check in
> Request.text but nowhere else, I think.

Oh wait! There's no Request.text. Neat!

Clay Gerrard

unread,

Sep 20, 2011, 4:02:15 PM9/20/11

to Sergey Schetinin, paste...@googlegroups.com

On Tue, Sep 20, 2011 at 2:22 PM, Sergey Schetinin <mal...@gmail.com> wrote:

On 20 September 2011 22:01, Clay Gerrard <clay.g...@gmail.com> wrote:
> On Tue, Sep 20, 2011 at 8:04 AM, Sergey Schetinin <mal...@gmail.com> wrote:

As one of the changes the request object bails when someone tries to
set it's charset to anything but utf-8, it also checks the environ on
creation and in the example you've provided it would not even allow
you instantiate it.

So I wanted to see if this experiment can go in

to the trunk or not.

That sounds awful. So I guess I would rather it not make it into trunk.

So, I see a good case for keeping charset is to allow routing of

requests like you've mentioned. So we can't assume that charset is

utf8, but we can, say, do that for form submissions and leave the rest

of the requests alone. That would require the charset check in

Request.text but nowhere else, I think.

Sounds reasonable.

I really want to deprecate those two attrs for a lot of reasons: they
aren't used in webob (get_response sets it, but for no good reason),
they are prone to creating reference cycles (@wsgify does that for
every request, ugh..) etc.

Right, in get_response. The test where I saw this is just asserting the middleware under test correctly set some expected adhoc attrs on the request on it's way into the app... I'd have to think about use case that would _require_ a resp having a copy of the request's environ... but you asked if any one was using them...

in this case - "testing wsgi apps" - it just seems kinda handy to be able to peak into and verify potential modification's of the original request's environ if you get a copy on the way out...

I can see the potential for reference cycles...

-clayg

Sergey Schetinin

unread,

Sep 20, 2011, 4:16:38 PM9/20/11

to Clay Gerrard, paste...@googlegroups.com

>> As one of the changes the request object bails when someone tries to
>> set it's charset to anything but utf-8, it also checks the environ on
>> creation and in the example you've provided it would not even allow
>> you instantiate it.
>>
>> So I wanted to see if this experiment can go in
>>
>> to the trunk or not.
>
> That sounds awful. So I guess I would rather it not make it into trunk.

Yeah, but if we only do that check for form submissions, maybe even
not in constructor but in POST, that could work.

>> I really want to deprecate those two attrs for a lot of reasons: they
>> aren't used in webob (get_response sets it, but for no good reason),
>> they are prone to creating reference cycles (@wsgify does that for
>> every request, ugh..) etc.
>
> Right, in get_response. The test where I saw this is just asserting the
> middleware under test correctly set some expected adhoc attrs on the request
> on it's way into the app... I'd have to think about use case that would
> _require_ a resp having a copy of the request's environ... but you asked if
> any one was using them...

So it is used in tests only?

So if it sets the ad-hoc attrs (as in env['webob.adhoc'] or whatever
that key is) you can check for them by creating a new Response(env)
instance. In fact that would make more sense, as if the check goes to
response.request it checks if the attrs were set but not if the
wrapped app can see them. I would argue that maybe the best way to
write a test like that is to do the checks in the wrapped app, maybe
through another, deeper layer of middleware.

> in this case - "testing wsgi apps" - it just seems kinda handy to be able to
> peak into and verify potential modification's of the original request's
> environ if you get a copy on the way out...
> I can see the potential for reference cycles...

But consider this: get_response calls the app with environ, not
request object, so the app itself never sees the exact request object,
yet the response object generated gets the request attached. Also, I
don't see why would you have access to the return value of
req.get_response, but not the req itself.

I understand that you just replied to the question if anybody is using
that attr and my point is that this deprecation is probably good for
the code that uses it.

Anyway. Seems like the content-type check should be relaxed somewhat /
deferred to .POST but overall the deprecation don't look too bad so
far.

If anyone here feels that they will be affected by them, please speak up.

-Sergey

--
http://self.maluke.com/

Clay Gerrard

unread,

Sep 20, 2011, 5:04:59 PM9/20/11

to Sergey Schetinin, paste...@googlegroups.com

On Tue, Sep 20, 2011 at 3:16 PM, Sergey Schetinin <mal...@gmail.com> wrote:

> That sounds awful. So I guess I would rather it not make it into trunk.

Yeah, but if we only do that check for form submissions, maybe even
not in constructor but in POST, that could work.

Awesome, thanks.

I would argue that maybe the best way to
write a test like that is to do the checks in the wrapped app, maybe
through another, deeper layer of middleware.

I had the same thought.

I think it was just getting used because it was there... Not to throw anyone under the bus, but I didn't personally write those tests - I was just doing some greps ;)

Also, I
don't see why would you have access to the return value of
req.get_response, but not the req itself.

Yes, I would hope you would. Unless, idk, maybe something "deeper" in the wsgi pipeline was using req.copy()? So now your reference to req doesn't contain any of the changes made to the copy? Seems like in that case you might be hozed anyway... I grasping here :P

Anyway. Seems like the content-type check should be relaxed somewhat /
deferred to .POST but overall the deprecation don't look too bad so
far.

If anyone here feels that they will be affected by them, please speak up.

Someone else will have to raise a flag if that's not good enough - I'm convinced this is probably ok.

Thanks for the heads up!

-clayg

Chris McDonough

unread,

Sep 20, 2011, 5:32:34 PM9/20/11

to Sergey Schetinin, Paste Users

On Tue, 2011-09-20 at 16:04 +0300, Sergey Schetinin wrote:
> Please reply to this email if any of the following is true:
> * if you ever set req.charset to anything but UTF-8
> * if you ever need to process requests that have Content-Type:
> ...;charset= anything but UTF-8
> * if you use Request / Response.unicode_errors
> * if you ever used Response.request or Response.environ

My concern with hardcoding charset to UTF-8 is not so much with requests
that are generated as a result of clicking on a form post button from
the same application that uses WebOb, but with a WebOb-based system
handling a form post from some other system that generates a non-UTF8
form encoding. I don't think adding middleware to handle the
transcoding is the right solution here, because the folks that do this
just won't know what's wrong until it's too late.

- C

Sergey Schetinin

unread,

Sep 21, 2011, 2:09:11 AM9/21/11

to Chris McDonough, Paste Users

But consider this: browsers never set the charset parameter in the
content-type header and that means unless the user explicitly sets
req.charset himself, all forms are treated as UTF-8. So if someone
depends on the non-utf-8 charsets they almost certainly know it. So
it's a change from one explicit solution to a different one.

And there are important benefits to this. One is that currently
charset is another dimension of variability for every test we have
(but we just assume the common case) and after the change it becomes
an invariant for every test and things are bright again.

--
http://self.maluke.com/

Chris McDonough

unread,

Sep 22, 2011, 4:35:37 AM9/22/11

to Sergey Schetinin, Paste Users

On Wed, 2011-09-21 at 09:09 +0300, Sergey Schetinin wrote:
> On 21 September 2011 00:32, Chris McDonough <chr...@plope.com> wrote:
> > On Tue, 2011-09-20 at 16:04 +0300, Sergey Schetinin wrote:
> >> Please reply to this email if any of the following is true:
> >> * if you ever set req.charset to anything but UTF-8
> >> * if you ever need to process requests that have Content-Type:
> >> ...;charset= anything but UTF-8
> >> * if you use Request / Response.unicode_errors
> >> * if you ever used Response.request or Response.environ
> >
> > My concern with hardcoding charset to UTF-8 is not so much with requests
> > that are generated as a result of clicking on a form post button from
> > the same application that uses WebOb, but with a WebOb-based system
> > handling a form post from some other system that generates a non-UTF8
> > form encoding. I don't think adding middleware to handle the
> > transcoding is the right solution here, because the folks that do this
> > just won't know what's wrong until it's too late.
>
> But consider this: browsers never set the charset parameter in the
> content-type header and that means unless the user explicitly sets
> req.charset himself, all forms are treated as UTF-8. So if someone
> depends on the non-utf-8 charsets they almost certainly know it. So
> it's a change from one explicit solution to a different one.

Right now, if someone has an third-party-system integration requirement,
he can decide, on a case-by-case basis, whether to set req.charset to
the charset that form values are actually encoded in when the third
party system renders a form.

For example, let's say a third-party system renders a form that looks
like this:

When that form is POSTed to the "aview" handler in a webob-using system,
its form values will be sent encoded using iso-8859-1. Today, if you
know this, you can do this in the webob-using system:

def aview(request):
request.charset = 'latin-1'
vals = request.POST

.. and everything works.

OTOH, if they set up middleware like you've created at
https://github.com/Pylons/webob/blob/sergey-1.2-py2-only/webob/transcode.py, it won't actually work, because, like you've said, browsers don't actually set the charset in the content-type header of requests they generate. So even with that middleware in place, it will be considered a UTF-8 request.

So where would someone inject logic to change that? By adding another
piece of middleware that rewrites the content-type header based on the
PATH_INFO? In my experience, every time I need to tell someone to "just
use middleware", I end up in a multi-hour conversation with them about
what middleware is, how it works, where it goes, how to configure it,
how to install it.. it's not much fun. I'd rather just tell them to set
request.charset.

> And there are important benefits to this. One is that currently
> charset is another dimension of variability for every test we have
> (but we just assume the common case) and after the change it becomes
> an invariant for every test and things are bright again.

I agree it does simplify things, but I think it simplifies them at the
expense of abandoning useful functionality.

- C

Sergey Schetinin

unread,

Sep 22, 2011, 5:03:46 AM9/22/11

to Chris McDonough, Paste Users

It usually does, but it relies on request.POST not being accessed
before that. If the caller already touched request.POST setting
charset like that will not change the the charset used by the POST
object (because it's cached). But of course this is minor and can be
fixed by invalidating that cache.

> OTOH, if they set up middleware like you've created at
> https://github.com/Pylons/webob/blob/sergey-1.2-py2-only/webob/transcode.py, it won't actually work, because, like you've said, browsers don't actually set the charset in the content-type header of requests they generate. So even with that middleware in place, it will be considered a UTF-8 request.
>
> So where would someone inject logic to change that? By adding another
> piece of middleware that rewrites the content-type header based on the
> PATH_INFO? In my experience, every time I need to tell someone to "just
> use middleware", I end up in a multi-hour conversation with them about
> what middleware is, how it works, where it goes, how to configure it,
> how to install it.. it's not much fun. I'd rather just tell them to set
> request.charset.
>
>> And there are important benefits to this. One is that currently
>> charset is another dimension of variability for every test we have
>> (but we just assume the common case) and after the change it becomes
>> an invariant for every test and things are bright again.
>
> I agree it does simplify things, but I think it simplifies them at the
> expense of abandoning useful functionality.

That's not final code, it's more of a draft to see what it takes to
transcode the request. There can be plenty of ways to do this, like
transcode_mw(app, charset='iso-8859-1') around the app that was
setting the charset. Or like this:

req2 = Request(transcode_env(req.environ, 'latin-1'))

Or maybe even this:

req.decode_request('latin-1')

That one can transcode the request in place or return a new one. I
think we can figure out a way to do this so that the API is still
simple (no need for middlewares) and the internals are simplified as
well.