we have a bit of chaos here ... Tickets 3370, 1356 and probably 952 all are about this problem, all are accepted, and #3370 and #1356 have very similar patches. I ask everybody to continue discussion here in django-developers, and I ask the authors of these three tickets to work together to find out how to proceed.
I'm posting a notice to django-users and will put a reference in the tickets.
@core: Please don't close these tickets as duplicates for the general unicodification at this time and let's see whether we can find a good solution short of total uncodification which would take a long time.
Michael
-- noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg - Tel +49-911-9352-0 - Fax +49-911-9352-100
Oh my, I should have called the subject "character encoding issues", it's not really about unicode. Sorry, but I don't want to rename the thread with the danger of splitting the discussions.
Sorry,
Michael Radziej:
-- noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg - Tel +49-911-9352-0 - Fax +49-911-9352-100
> we have a bit of chaos here ... Tickets 3370, 1356 and probably 952 > all are about this problem, all are accepted, and #3370 and #1356 > have very similar patches. I ask everybody to continue discussion > here in django-developers, and I ask the authors of these three > tickets to work together to find out how to proceed.
Right :-). I'll generalize my comment in #3370 here.
There are, in fact, two separate issues.
1. First one (that #952 was intended to fix) is that we don't have a notion of a database internal encoding at all. This is bad because DB is as external to Django as the web and it can be in any encoding.
Then there are two ways of dealing with it:
- let Django encode data into a charset that a database expects - tell a database which encoding Django uses and let it to encode data into its internals
#952 is implemented as a second variant and it looks like it works (in fact author of it is Julian Tarkhanov -- a well known unicode expert and advocate in russian blogosphere.. just giving credits :-) )
We really should have this thing regardless of Django's unicode or byte-string internals.
2. The second issue is an automatic conversion of unicode data for db backends that don't understand unicode. It's become relevant recently because people started to use newforms. If we accept #952 as it is then this should be resolved be encoding things into 'utf-8' inside backends. If we chose to reimplement database encoding support on django side then backend should encode into whatever encoding is stored in DATABASE_CHARSET setting.
The problem is simple but it was born a very long time ago. For MySQL 4.1 and higher there is hardcoded in django/db/backends/mysql/base.py: cursor.execute("SET NAMES 'utf8'") there were lots of tickets and messages in django-users complaining to this but in fact they all were ignored. Personally my company used to use patched django installation where this line was replaced to: cursor.execute("SET NAMES 'cp1251'") because all our templates were (and still are in the production environment) in windows-1251 encoding so we have had to use cp1251 to deal with db. Ticket http://code.djangoproject.com/ticket/952 contain a complete solution of this problem and I don't know why it was not merged into the code but at the moment it is not matter and here is the reason why: Since newforms library was born and the decision about using unicode for clean_data was made, all these patches became unnecessary because now developers must use only unicode everywhere (templates, db etc) or manually recode all forms based on newforms from unicode to native encoding and back. Ofcourse this is stupid and noone will do it because it's easier to migrate to utf-8 and forget about the problem.
So, for me the quesion sounds like this: either newforms don't use unicode to store clean_data and we can keep using 'legacy' character sets, or django needs to drop all charsets support except of unicode. Or it should convert strings back and forth everywhere LOL
here's a summary what the different tickets are about:
# 952 adds a database client encoding setting, DATABASE_CLIENT_CHARSET, for mysql and postgresql backends. For mysql, it uses the given charset in 'SET NAMES' to build the connection, except for mysql < 4.1. For postgresql, it does a 'SET CLIENT_ENCODING TO'.
# 1356 sets the charset attribute of the mysql backend connection to 'utf8' for mysql version >= 4.1
# 3370 starts by explaining a traceback within newforms when you use utf8-encoded values with a form created by form_for_instance and has a patch that adds 'charset':'utf8' to the kwargs used in Database.connect() within DatabaseWrapper.cursor()
Michael Radziej
-- noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg - Tel +49-911-9352-0 - Fax +49-911-9352-100
ak wrote: > Ticket http://code.djangoproject.com/ticket/952 contain a complete > solution of this problem and I don't know why it was not merged into > the code but at the moment it is not matter and here is the reason why: > Since newforms library was born and the decision about using unicode > for clean_data was made, all these patches became unnecessary
Not at all. Anton, read my summary that I posted as a reply to Michael first post. Specifying database encoding and keeping internals in unicode are two separate issues. #952 is still necessary but not enough to fix your bug.
> because > now developers must use only unicode everywhere (templates, db etc)
Actually the shouldn't :-). Newforms is now the only part of Django that works with unicode. I/O with th web (requests and templates) are now hotfixed to work with it in a way. Databases aren't.
> or > manually recode all forms based on newforms from unicode to native > encoding and back. Ofcourse this is stupid
May be it is. But it's a temporary inconvenience of newforms. Later database backend should do this automatically by using either 'utf-8' or DATABASE_CHARSET as I described in that my message.
BTW, there were ideas here about really really forcing users to migrate all data into unicode/utf-8 and be the first guy on the block that would lead the trend. This is noble but hard and if I remember correctly this was decided against...
> So, for me the quesion sounds like this: either newforms don't use > unicode to store clean_data and we can keep using 'legacy' character > sets, or django needs to drop all charsets support except of unicode. > Or it should convert strings back and forth everywhere LOL
Incidentally you last 'LOL' is the option that Django have chosen :-). I'll try to explain.
'Unicode' is not a charset, or, more specifically, it is not represented with bytes. Python's native unicode string represent unicode characters in some internal format that just can't be dumped over the wire, be it to database or to the web. Because of this if Django would work internally in unicode it must encode everything it writes and decode everything it reads from outside. Converting from unicode to utf-8 is also encoding, and it does not happen automatically.
When you say that db backend supports 'unicode' it actually means that db library under Django backend does the encoding itself. But whether it's done in the library or in Django backend we still need a setting for charset. Two settings actually: for the web (that we already have) and for db (that is implemented in #952).
On Jan 26, 1:07 pm, Ivan Sagalaev <Man...@SoftwareManiacs.Org> wrote:
> BTW, there were ideas here about really really forcing users to migrate > all data into unicode/utf-8 and be the first guy on the block that would > lead the trend. This is noble but hard and if I remember correctly this > was decided against...
Spiteful. Those left behind shall overcome their pain and join.
> > So, for me the quesion sounds like this: either newforms don't use > > unicode to store clean_data and we can keep using 'legacy' character > > sets, or django needs to drop all charsets support except of unicode. > > Or it should convert strings back and forth everywhere LOLIncidentally you last 'LOL' is the option that Django have chosen :-).
This is about getting expectable bytestrings from the DB, not about unicodifying Django.
> 'Unicode' is not a charset, or, more specifically, it is not represented > with bytes. Python's native unicode string represent unicode characters > in some internal format that just can't be dumped over the wire, be it > to database or to the web. Because of this if Django would work > internally in unicode it must encode everything it writes and decode > everything it reads from outside. Converting from unicode to utf-8 is > also encoding, and it does not happen automatically.
Python's unicode is actually UTF-16 whereas IO and the databases mostly speak UTF-8 - so no, you can't dump it over the wire. We Rubyists are a tad happier because we now have all in UTF-8 - but if I was working with Django now I would actually _mandate_ the following: - All templates should be UTF-8 (decode on read) - All code should be native Python Unicode (utf16, I don't know how it works with BE-LE but the idea of UTF-16 is really anti-interop) or UTF-8, but I am no Python expert to say whichever is better - All database adapters have to be verified for returning ustrings, and I can ascerain you that most of them won't - Mandate UTF-16 or UTF-8 as client encoding for the database. Does not matter which encoding is used internally because both Postgres and MySQL can now encode/decode on the fly (you will just lose characters if your database is limited)
> for charset. Two settings actually: for the web (that we already have) > and for db (that is implemented in #952).
I did the #952 when experimenting with Django for my own needs. It's since then abandoned. The solution I made in #952 is the "liberal" one, but I really don't like it - there's need for much more radical solution. Part that solution would be saying to users using old 8-bit crap for code and templates that they are out in the dumps. So feel free to do whatever you find useful with the patch
it's decided at compile-time, and i'ts either utf-32 or utf-16.
on linux it's usually utf-32, and on windows it's usually (always?) utf-16.
but you should not care about it. you see, in python, the unicode-strings are a separate data-type, and there's just no way to take a bytestring, and tell python: "from now on, you are an unicode-string, because i know that you are encoded in utf-16."
the way it works is that you take a bytestring, and ask python to convert it into an unicode-string (and you also have to tell python the bytestring's charset).
so while it might be, that the conversion from utf-16-bytestrings to unicode is sometimes faster thatn converting from utf-8-bytestrings to unicode, you can't be sure, because as i wrote above, the internal unicode-encoding is not fixed.
> whereas IO and the databases mostly > speak UTF-8 - > so no, you can't dump it over the wire. > We Rubyists are a tad happier > because we now > have all in UTF-8
you mean that regexes, and all the methods of the string-class now are unicode-aware in ruby? :)
> on linux it's usually utf-32, and on windows it's usually (always?) > utf-16.
sorry I forgot that - it's been a year at least since I last touched Python (actually it was for the Django test drive)
> but you should not care about it. you see, in python, > the unicode-strings are a separate data-type, and there's > just no way to take a bytestring, and tell python: "from now on, > you are an unicode-string, because i know that you are encoded in > utf-16."
segregating ustrings and strings is BBD, been' telling it for years. The latest I heard is that the next major Py will abolish bytestrings for good.
Getting back to the issue that we were on, I am still strongly advocating the "don't go there" approach for anything but Unicode. How it should be handled in relation to source code is unknown to me (AFAIK Python has a pre-amble sort of declaration that you can actually use to tell the interpreter which encoding your source is in). I just know you hit some major pain when you expect ustrings and get bytestrings instead (and in Python, just as in Perl, only about 30% of the libraries actually care about what they give you).
> so while it might be, that the conversion from utf-16-bytestrings to > unicode is sometimes faster thatn converting from utf-8-bytestrings to > unicode, you can't be sure, because as i wrote above, the internal > unicode-encoding is not fixed.
>> whereas IO and the databases mostly >> speak UTF-8 - >> so no, you can't dump it over the wire.
>> We Rubyists are a tad happier >> because we now >> have all in UTF-8
> you mean that regexes, and all the methods of the string-class now are > unicode-aware in ruby? :)
Regexes are unicode-aware for some time already except the case- sensitivity and the class repertoire (which will be fixed when Oniguruma is there). As for the string methods, we mostly took care of them with AS::Multibyte (without silly subclassing) and that works wonders for me. The greatest advantage is that I never have to check what's coming down the pipe because there's only one String to rule them all. -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl
On Jan 26, 2007, at 11:12 AM, Michael Radziej wrote:
> I ask everybody to continue discussion > here in django-developers, and I ask the authors of these three > tickets to work together to find out how to proceed.
#952 is the most liberal of all because it does not assume anything about Django's internals, it just tells the binary DB client to decode/encode behind the scenes so that it returns something meaningful (not something the server admin has decided upon two years ago say). -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl
On Jan 26, 2007, at 11:47 AM, Michael Radziej wrote:
> # 1356 sets the charset attribute of the mysql backend connection to > 'utf8' for mysql version >= 4.1
And leaves everyone who wants to operate in 8 bits out in the cold. Where they actually ought to be anyway, but I tried to stay liberal in 952 - primarily because it's still unknown how Django authors want to approach this.
-- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl