./manage.py dumpdata --format=json
command produces unreadable output for non-ascii symbols now (they are
encoded as \uxxxx ).
Such encoding is not required according to
http://www.ietf.org/rfc/rfc4627.txt
(section 2.5):
"All Unicode characters may be placed within the quotation marks
except for the characters that must be escaped: quotation mark,
reverse solidus, and the control characters (U+0000 through U+001F)."
There is a snippet (
http://djangosnippets.org/snippets/2258/ ) with a
quick fix but it is incorrect.
The encoding is performed by django.core.serializers.json.Serializer.
Output is not nice because ensure_ascii argument is not set (and it is
True by default):
def end_serialization(self):
simplejson.dump(self.objects, self.stream,
cls=DjangoJSONEncoder, **self.options)
It seems there is no way to pass additional options to serializer with
existing dumpdata command:
return serializers.serialize(format, objects, indent=indent,
use_natural_keys=use_natural_keys)
But wait, we can define custom serializers! This is undocumented but
quite easy:
# serializers/pretty_json.py
from django.core.serializers.json import Serializer as JSONSerializer
from django.core.serializers.json import Deserializer as
JSONDeserializer
from django.core.serializers.json import DjangoJSONEncoder
from django.utils import simplejson
class Serializer(JSONSerializer):
def end_serialization(self):
simplejson.dump(self.objects, self.stream,
cls=DjangoJSONEncoder,
ensure_ascii=False, **self.options)
Deserializer = JSONDeserializer
Then it can be added to SERIALIZATION_MODULES (by the way, this option
is not mentioned in serializer docs):
SERIALIZATION_MODULES = {'json-pretty': 'serializers.pretty_json'}.
After that, I expect this to work:
./manage.py dumpdata --format=json-pretty <app_name>
But it doesn't work and fails with an UnicodeEncodeError:
Traceback (most recent call last):
File "./manage.py", line 11, in <module>
execute_manager(settings)
File "/Users/kmike/envs/planor/src/django/django/core/management/
__init__.py", line 438, in execute_manager
utility.execute()
File "/Users/kmike/envs/planor/src/django/django/core/management/
__init__.py", line 379, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/Users/kmike/envs/planor/src/django/django/core/management/
base.py", line 191, in run_from_argv
self.execute(*args, **options.__dict__)
File "/Users/kmike/envs/planor/src/django/django/core/management/
base.py", line 229, in execute
self.stdout.write(output)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
184-189: ordinal not in range(128)
Management commands write to stdout and python's stdout doesn't
perform any meaningful unicode encoding. Argh.
So self.stdout should be changed to something that understands utf8,
e.g.:
import sys
import codecs
from django.core.management.commands.dumpdata import Command as
Dumpdata
class Command(Dumpdata):
def execute(self, *args, **options):
stdout = options.get('stdout', sys.stdout)
options['stdout'] = codecs.getwriter('utf8')(stdout)
return super(Command, self).execute(*args, **options)
After these changes initial problem is solved.
But I think my adventure reveals something that can be improved in
django itself.
1) It is even mentioned in django docs (
http://docs.djangoproject.com/en/1.2/topics/serialization/#notes-for-specific-serialization-formats
) that "If you're using UTF-8 (or any other non-ASCII encoding) data
with the JSON serializer, you must pass ensure_ascii=False as a
parameter to the serialize() call. Otherwise, the output won't be
encoded correctly."
If I'm understanding properly, with ensure_ascii=True serializer
doesn't work for non-ascii bytestrings and produces ugly output for
unicode strings; with ensure_ascii=False serializer works for all
bytestrings and produces nice output for unicode strings. So why is
ensure_ascii=True default? Setting it to False is technically
backwards-incompatible but I think it is backwards-incompatible in a
good way.
2) If stdout doesn't handle Unicode, wouldn't it be better to wrap
'output' with smart_str or to wrap stdout itself to be utf8-capable by
default? django.core.management.base.BaseCommand.execute wraps data
written to self.stderr with smart_string, but this is not the case for
self.stdout.
After 1) and 2) the initial problem with non-readable fixtures will be
solved automatically without defining custom serializers and hacking
with management commands.
I'm not opening separate issues in bug tracker because they all seems
to be related (dumpdata improvements, serializer improvements,
BaseCommand unicode handling), the changes are not trivial (encodings
can be hard and stdout is often mysterious) and I want to make sure
I'm not missing a point.