charset encoding and unicode

457 views
Skip to first unread message

smcoll

unread,
Feb 17, 2012, 11:04:22 AM2/17/12
to django-rest-framework
Returning to the issue at
http://groups.google.com/group/django-rest-framework/browse_thread/thread/f1aa2a7e7575f2de/a188fd1967498d2c
- some recap:

Given a value with special characters like 'someñame', the current
implementation of DocumentingTemplateRenderers will all display this
instead for the whole object:

[%d bytes of binary content]

This is because the check used to validate the object is a test
whether it is printable or not:
https://github.com/tomchristie/django-rest-framework/blob/master/djangorestframework/renderers.py#L223

It was suggested to modify the DocumentingTemplateRenderer class (see
https://gist.github.com/1849267) and that has worked fine. i'd like
to see that adopted in the codebase, so i don't have to maintain my
own copy.

Also, the JSONRenderer will display that value as 'somei\u00f1ame'.
It was suggested to make a JSON rendered where ensure_ascii is false
(to make this easy to override, i offer the following: https://gist.github.com/1849103)
but even then, the value will render as 'someñame', which is still
incorrect. i'm not sure what to do there.

If you think these issues warrant a fork and pull request, let me
know.

shannon

Tom Christie

unread,
Feb 17, 2012, 12:29:32 PM2/17/12
to django-res...@googlegroups.com
If you think these issues warrant a fork and pull request, let me know.

Yes. Absolutely they do.
I've looked into it, and it's slightly complicated, tho, there's several different things that ought to happen.

I think there's up to four seperate pull reqs needed here:

Firstly:

The escaping in DocumentingTemplateRenderer should definitely be refactored into an `escape_binary` method.
That'll let it be overridden as needed.  I don't think it needs a separate boolean attribute, just refactoring out the method would be fine.

Secondly:

We need to differentiate between binary and string content in a more intelligent way.
Renderers that return string content should *always* return unicode.
Renderers that return binary content should return bytes. (Or str, which is the same thing in Python 2.x)
I think the template render will already do that.  The json, jsonp, and yaml don't.  I'm not sure about the xml, I don't think it does.
So...

Fix json: Add `ensure_ascii=False` in `dumps()`, wrap the entire result of dumps() up in 'unicode(dumps(...))'
Fix jsonp: Add 'u' prefix on the returned string, and fix json as above.
Fix yaml: Add encoding=None in `safe_dump()`.
Fix xml: Not totally sure.

Thirdly:

After that we can fix up the escape_binary test, so that it does not escape if `isinstance(obj, unicode)`, and it does otherwise.

Finally:

When we render the response, we ought to be setting the charset if it's a unicode object (and not otherwise).


Something like:

    def get_content_type(content, mimetype):
        if isinstance(content, unicode):
            return "%s; %s" % (response.media_type, settings.DEFAULT_CHARSET)
        else:
            return response.media_type

    And generating the final response with:

    content_type = self.get_content_type(content, response.media_type)
    HttpResponse(content, status=response.status, content_type=content_type)

I would love a set of pull requests addressing those!  (Or a single pull req I guess)

smcoll

unread,
Feb 28, 2012, 11:55:20 AM2/28/12
to django-rest-framework
Finally getting to this. i'd like to start with some tests, and
wanted to ask your preference for how i might integrate them.

i figure i could add a test in djangorestframework/tests/renderers.py
that compared an object to the output, such as you have with
_flat_repr for the JSON tests. i'm not sure where to put it- should i
add one to the test for each renderer? i could do something like
change _flat_repr to '{"foo": ["bar", "baz", "niña"]}' but i'm not
sure if i like that because character encoding seems like a distinct
issue.

Aside from testing the one character (ñ) that gave me trouble, i
wonder if there's a more standard way to test unicode rendering.
> https://github.com/tomchristie/django-rest-framework/blob/master/djan...

Tom Christie

unread,
Feb 28, 2012, 1:40:18 PM2/28/12
to django-res...@googlegroups.com
Wouldn't sweat it too much. Testing against one character, just against the JSONRenderer would be a good start.

smcoll

unread,
Feb 28, 2012, 2:32:21 PM2/28/12
to django-rest-framework
OK.

> Thirdly:
>
> After that we can fix up the escape_binary test, so that it does not escape
> if `isinstance(obj, unicode)`, and it does otherwise.

Are you saying we could change the _escape_binary method to something
like this?

    def _escape_binary(self, content):
        if isinstance(content, unicode):
            return content
        else:
            return '[%d bytes of binary content]' % len(content)

smcoll

unread,
Feb 28, 2012, 3:14:38 PM2/28/12
to django-rest-framework
Not familiar enough with unicode to manage any proper tests. But how
does this look to you?

https://github.com/smcoll/django-rest-framework/commit/fa8efcb15cd2993f817f9d68d6468e575ddafe48

Tom Christie

unread,
Mar 15, 2012, 9:28:29 AM3/15/12
to django-res...@googlegroups.com
Looks good to me.

poswald

unread,
Mar 17, 2012, 11:48:39 PM3/17/12
to django-rest-framework
Does this fix the issue with non-latin characters being formatted as
'somei\u00f1ame' in the browsable UI? Since most of our api users are
using non-latin chars, I would love to see this come out decoded for
them at least in the web UI and I'd be willing to test it out as well.



On Mar 15, 10:28 pm, Tom Christie <christie....@gmail.com> wrote:
> Looks good to me.
>
>
>
>
>
>
>
> On Tuesday, 28 February 2012 20:14:38 UTC, smcoll wrote:
>
> > Not familiar enough with unicode to manage any proper tests.  But how
> > does this look to you?
>
> >https://github.com/smcoll/django-rest-framework/commit/fa8efcb15cd299...

Tom Christie

unread,
Mar 18, 2012, 1:09:16 PM3/18/12
to django-res...@googlegroups.com
Yup it's intended to deal with properly encoding outputs both with the browse-able HTML and with other representations.

Tom Christie

unread,
Mar 18, 2012, 1:10:32 PM3/18/12
to django-res...@googlegroups.com
Last thing really is that it needs one or two simple tests.  (Eg check that a unicode snowman character gets properly rendered both in HTML and in JSON.) 

Max Arnold

unread,
Mar 29, 2012, 12:54:45 PM3/29/12
to django-rest-framework, arnold...@gmail.com
On 29 фев, 03:14, smcoll <smc...@gmail.com> wrote:
> Not familiar enough with unicode to manage any proper tests.  But how
> does this look to you?
>
> https://github.com/smcoll/django-rest-framework/commit/fa8efcb15cd299...

I believe these two lines must be removed:

if not all(char in string.printable for char in content):
return '[%d bytes of binary content]'

(see https://github.com/max-arnold/django-rest-framework/commit/a86c547c960c73daad401f0218f1e42f41349c5a)

Max Arnold

unread,
Mar 29, 2012, 1:46:21 PM3/29/12
to django-rest-framework
On Thu, Mar 29, 2012 at 09:54:45AM -0700, Max Arnold wrote:
> On 29 фев, 03:14, smcoll <smc...@gmail.com> wrote:
> > Not familiar enough with unicode to manage any proper tests.  But how
> > does this look to you?

Also these two commits actually fix content-type charset:

https://github.com/max-arnold/django-rest-framework/commit/7171129eb58246b308247122c03a0d3b3cb65e7f
https://github.com/max-arnold/django-rest-framework/commit/54b5104f715e4ef44911c7e8f3c5b0dca8d98c6c

I'm not sure if "else" branch of get_content_type is actually required:

def get_content_type(self, content, mimetype):
if isinstance(content, unicode):
return "%s; charset=%s" % (mimetype, settings.DEFAULT_CHARSET)
else:
return mimetype

Reply all
Reply to author
Forward
0 new messages