Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 55 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Michael Radziej  
View profile  
 More options Jan 26 2007, 5:12 am
From: Michael Radziej <m...@noris.de>
Date: Fri, 26 Jan 2007 11:12:42 +0100
Local: Fri, Jan 26 2007 5:12 am
Subject: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Hi,

we have a bit of chaos here ... Tickets 3370, 1356 and probably 952
all are about this problem, all are accepted, and #3370 and #1356
have very similar patches. I ask everybody to continue discussion
here in django-developers, and I ask the authors of these three
tickets to work together to find out how to proceed.

I'm posting a notice to django-users and will put a reference in the
tickets.

@core: Please don't close these tickets as duplicates for the
general unicodification at this time and let's see whether we can
find a good solution short of total uncodification which would take
a long time.

Michael

--
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Radziej  
View profile  
 More options Jan 26 2007, 5:15 am
From: Michael Radziej <m...@noris.de>
Date: Fri, 26 Jan 2007 11:15:03 +0100
Local: Fri, Jan 26 2007 5:15 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Oh my, I should have called the subject "character encoding issues",
it's not really about unicode. Sorry, but I don't want to rename the
thread with the danger of splitting the discussions.

Sorry,

Michael Radziej:

--
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Jan 26 2007, 5:35 am
From: Ivan Sagalaev <Man...@SoftwareManiacs.Org>
Date: Fri, 26 Jan 2007 13:35:59 +0300
Local: Fri, Jan 26 2007 5:35 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

Michael Radziej wrote:
> Hi,

> we have a bit of chaos here ... Tickets 3370, 1356 and probably 952
> all are about this problem, all are accepted, and #3370 and #1356
> have very similar patches. I ask everybody to continue discussion
> here in django-developers, and I ask the authors of these three
> tickets to work together to find out how to proceed.

Right :-). I'll generalize my comment in #3370 here.

There are, in fact, two separate issues.

1.  First one (that #952 was intended to fix) is that we don't have a
notion of a database internal encoding at all. This is bad because DB is
as external to Django as the web and it can be in any encoding.

     Then there are two ways of dealing with it:

     - let Django encode data into a charset that a database expects
     - tell a database which encoding Django uses and let it to encode
       data into its internals

     #952 is implemented as a second variant and it looks like it works
(in fact author of it is Julian Tarkhanov -- a well known unicode expert
and advocate in russian blogosphere.. just giving credits :-) )

     We really should have this thing regardless of Django's unicode or
byte-string internals.

2. The second issue is an automatic conversion of unicode data for db
backends that don't understand unicode. It's become relevant recently
because people started to use newforms. If we accept #952 as it is then
this should be resolved be encoding things into 'utf-8' inside backends.
If we chose to reimplement database encoding support on django side then
backend should encode into whatever encoding is stored in
DATABASE_CHARSET setting.

This is what things are like now.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ak  
View profile  
 More options Jan 26 2007, 5:42 am
From: "ak" <an...@khalikov.ru>
Date: Fri, 26 Jan 2007 02:42:24 -0800
Local: Fri, Jan 26 2007 5:42 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Guys

The problem is simple but it was born a very long time ago.
For MySQL 4.1 and higher there is hardcoded in
django/db/backends/mysql/base.py:
cursor.execute("SET NAMES 'utf8'")
there were lots of tickets and messages in django-users complaining to
this but in fact they all were ignored.
Personally my company used to use patched django installation where
this line was replaced to:
cursor.execute("SET NAMES 'cp1251'")
because all our templates were (and still are in the production
environment) in windows-1251 encoding so we have had to use cp1251 to
deal with db.
Ticket http://code.djangoproject.com/ticket/952 contain a complete
solution of this problem and I don't know why it was not merged into
the code but at the moment it is not matter and here is the reason why:
Since newforms library was born and the decision about using unicode
for clean_data was made, all these patches became unnecessary because
now developers must use only unicode everywhere (templates, db etc) or
manually recode all forms based on newforms from unicode to native
encoding and back. Ofcourse this is stupid and noone will do it because
it's easier to migrate to utf-8 and forget about the problem.

So, for me the quesion sounds like this: either newforms don't use
unicode to store clean_data and we can keep using 'legacy' character
sets, or django needs to drop all charsets support except of unicode.
Or it should convert strings back and forth everywhere LOL

Any other opinions ?


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Radziej  
View profile  
 More options Jan 26 2007, 5:47 am
From: Michael Radziej <m...@noris.de>
Date: Fri, 26 Jan 2007 11:47:40 +0100
Local: Fri, Jan 26 2007 5:47 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Hi,

here's a summary what the different tickets are about:

# 952 adds a database client encoding setting,
DATABASE_CLIENT_CHARSET, for mysql and postgresql backends. For
mysql, it uses the given charset in 'SET NAMES' to build the
connection, except for mysql < 4.1. For postgresql, it does a 'SET
CLIENT_ENCODING TO'.

# 1356 sets the charset attribute of the mysql backend connection to
'utf8' for mysql version >= 4.1

# 3370 starts by explaining a traceback within newforms when you use
utf8-encoded values with a form created by form_for_instance and has
a patch that adds 'charset':'utf8' to the kwargs used in
Database.connect() within DatabaseWrapper.cursor()

Michael Radziej

--
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Jan 26 2007, 7:07 am
From: Ivan Sagalaev <Man...@SoftwareManiacs.Org>
Date: Fri, 26 Jan 2007 15:07:10 +0300
Local: Fri, Jan 26 2007 7:07 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

ak wrote:
> Ticket http://code.djangoproject.com/ticket/952 contain a complete
> solution of this problem and I don't know why it was not merged into
> the code but at the moment it is not matter and here is the reason why:
> Since newforms library was born and the decision about using unicode
> for clean_data was made, all these patches became unnecessary

Not at all. Anton, read my summary that I posted as a reply to Michael
first post. Specifying database encoding and keeping internals in
unicode are two separate issues. #952 is still necessary but not enough
to fix your bug.

> because
> now developers must use only unicode everywhere (templates, db etc)

Actually the shouldn't :-). Newforms is now the only part of Django that
works with unicode. I/O with th web (requests and templates) are now
hotfixed to work with it in a way. Databases aren't.

> or
> manually recode all forms based on newforms from unicode to native
> encoding and back. Ofcourse this is stupid

May be it is. But it's a temporary inconvenience of newforms. Later
database backend should do this automatically by using either 'utf-8' or
DATABASE_CHARSET as I described in that my message.

BTW, there were ideas here about really really forcing users to migrate
all data into unicode/utf-8 and be the first guy on the block that would
lead the trend. This is noble but hard and if I remember correctly this
was decided against...

> So, for me the quesion sounds like this: either newforms don't use
> unicode to store clean_data and we can keep using 'legacy' character
> sets, or django needs to drop all charsets support except of unicode.
> Or it should convert strings back and forth everywhere LOL

Incidentally you last 'LOL' is the option that Django have chosen :-).
I'll try to explain.

'Unicode' is not a charset, or, more specifically, it is not represented
with bytes. Python's native unicode string represent unicode characters
in some internal format that just can't be dumped over the wire, be it
to database or to the web. Because of this if Django would work
internally in unicode it must encode everything it writes and decode
everything it reads from outside. Converting from unicode to utf-8 is
also encoding, and it does not happen automatically.

When you say that db backend supports 'unicode' it actually means that
db library under Django backend does the encoding itself. But whether
it's done in the library or in Django backend we still need a setting
for charset. Two settings actually: for the web (that we already have)
and for db (that is implemented in #952).


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Julian 'Julik' Tarkhanov  
View profile  
 More options Jan 26 2007, 8:09 am
From: "Julian 'Julik' Tarkhanov" <julian.tarkha...@gmail.com>
Date: Fri, 26 Jan 2007 05:09:31 -0800
Local: Fri, Jan 26 2007 8:09 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

On Jan 26, 1:07 pm, Ivan Sagalaev <Man...@SoftwareManiacs.Org> wrote:

> BTW, there were ideas here about really really forcing users to migrate
> all data into unicode/utf-8 and be the first guy on the block that would
> lead the trend. This is noble but hard and if I remember correctly this
> was decided against...

Spiteful. Those left behind shall overcome their pain and join.

> > So, for me the quesion sounds like this: either newforms don't use
> > unicode to store clean_data and we can keep using 'legacy' character
> > sets, or django needs to drop all charsets support except of unicode.
> > Or it should convert strings back and forth everywhere LOLIncidentally you last 'LOL' is the option that Django have chosen :-).

This is about getting expectable bytestrings from the DB, not about
unicodifying Django.

> 'Unicode' is not a charset, or, more specifically, it is not represented
> with bytes. Python's native unicode string represent unicode characters
> in some internal format that just can't be dumped over the wire, be it
> to database or to the web. Because of this if Django would work
> internally in unicode it must encode everything it writes and decode
> everything it reads from outside. Converting from unicode to utf-8 is
> also encoding, and it does not happen automatically.

Python's unicode is actually UTF-16 whereas IO and the databases mostly
speak UTF-8 -
so no, you can't dump it over the wire. We Rubyists are a tad happier
because we now
have all in UTF-8 - but if I was working with Django now I would
actually _mandate_ the following:
- All templates should be UTF-8 (decode on read)
- All code should be native Python Unicode (utf16, I don't know how it
works with BE-LE but the idea of UTF-16 is really anti-interop) or
UTF-8, but I am no Python expert to say whichever is better
- All database adapters have to be verified for returning ustrings, and
I can ascerain you that most of them won't
- Mandate UTF-16 or UTF-8 as client encoding for the database. Does not
matter which encoding is used internally because both Postgres
and MySQL can now encode/decode on the fly (you will just lose
characters if your database is limited)

> for charset. Two settings actually: for the web (that we already have)
> and for db (that is implemented in #952).

I did the #952 when experimenting with Django for my own needs. It's
since then abandoned. The solution I made in #952 is the "liberal" one,
but I really don't like it - there's need for much more radical
solution. Part that solution would be saying to users using old 8-bit
crap for code and templates that they are out in the dumps. So feel
free to do whatever you find useful with the patch

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gábor Farkas  
View profile  
 More options Jan 26 2007, 8:25 am
From: Gábor Farkas <ga...@nekomancer.net>
Date: Fri, 26 Jan 2007 14:25:53 +0100
Local: Fri, Jan 26 2007 8:25 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

Julian 'Julik' Tarkhanov wrote:

> Python's unicode is actually UTF-16

sorry, but no. it's not utf-16.

it's decided at compile-time,
and i'ts either utf-32 or utf-16.

on linux it's usually utf-32, and on windows it's usually (always?) utf-16.

but you should not care about it. you see, in python,
the unicode-strings are a separate data-type, and there's
just no way to take a bytestring, and tell python: "from now on,
you are an unicode-string, because i know that you are encoded in utf-16."

the way it works is that you take a bytestring,
and ask python to convert it into an unicode-string (and you also have
to tell python the bytestring's charset).

so while it might be, that the conversion from utf-16-bytestrings to
unicode is sometimes faster thatn converting from utf-8-bytestrings to
unicode, you can't be sure, because as i wrote above, the internal
unicode-encoding is not fixed.

> whereas IO and the databases mostly
> speak UTF-8 -
> so no, you can't dump it over the wire.
> We Rubyists are a tad happier
> because we now
> have all in UTF-8

you mean that regexes, and all the methods of the string-class now are
unicode-aware in ruby? :)

gabor


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Julian 'Julik' Tarkhanov  
View profile  
 More options Jan 26 2007, 9:25 am
From: Julian 'Julik' Tarkhanov <julian.tarkha...@gmail.com>
Date: Fri, 26 Jan 2007 15:25:51 +0100
Local: Fri, Jan 26 2007 9:25 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

On Jan 26, 2007, at 2:25 PM, Gábor Farkas wrote:

> Julian 'Julik' Tarkhanov wrote:

>> Python's unicode is actually UTF-16

> on linux it's usually utf-32, and on windows it's usually (always?)  
> utf-16.

sorry I forgot that - it's been a year at least since I last touched  
Python (actually it was
for the Django test drive)

> but you should not care about it. you see, in python,
> the unicode-strings are a separate data-type, and there's
> just no way to take a bytestring, and tell python: "from now on,
> you are an unicode-string, because i know that you are encoded in  
> utf-16."

segregating ustrings and strings is BBD, been' telling it for years.  
The latest I heard
is that the next major Py will abolish bytestrings for good.

Getting back to the issue that we were on, I am still strongly  
advocating the
"don't go there" approach for anything but Unicode. How it should be  
handled in relation to
source code is unknown to me (AFAIK Python has a pre-amble sort of  
declaration that you can actually use
to tell the interpreter which encoding your source is in). I just  
know you hit some major pain when you expect ustrings and
get bytestrings instead (and in Python, just as in Perl, only about  
30% of the libraries actually care about what they give you).

> so while it might be, that the conversion from utf-16-bytestrings to
> unicode is sometimes faster thatn converting from utf-8-bytestrings to
> unicode, you can't be sure, because as i wrote above, the internal
> unicode-encoding is not fixed.

>> whereas IO and the databases mostly
>> speak UTF-8 -
>> so no, you can't dump it over the wire.

>> We Rubyists are a tad happier
>> because we now
>> have all in UTF-8

> you mean that regexes, and all the methods of the string-class now are
> unicode-aware in ruby? :)

Regexes are unicode-aware for some time already except the case-
sensitivity and the class repertoire (which will be fixed when  
Oniguruma is there). As for
the string methods, we mostly took care of them with AS::Multibyte  
(without silly subclassing) and that works wonders for me. The  
greatest advantage is that I never
have to check what's coming down the pipe because there's only one  
String to rule them all.
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Julian 'Julik' Tarkhanov  
View profile  
 More options Jan 26 2007, 9:26 am
From: Julian 'Julik' Tarkhanov <julian.tarkha...@gmail.com>
Date: Fri, 26 Jan 2007 15:26:56 +0100
Local: Fri, Jan 26 2007 9:26 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

On Jan 26, 2007, at 11:12 AM, Michael Radziej wrote:

>  I ask everybody to continue discussion
> here in django-developers, and I ask the authors of these three
> tickets to work together to find out how to proceed.

#952 is the most liberal of all because it does not assume anything  
about Django's internals, it just tells the binary DB client
to decode/encode behind the scenes so that it returns something  
meaningful (not something the server admin has decided upon two years  
ago say).
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Julian 'Julik' Tarkhanov  
View profile  
 More options Jan 26 2007, 9:28 am
From: Julian 'Julik' Tarkhanov <julian.tarkha...@gmail.com>
Date: Fri, 26 Jan 2007 15:28:02 +0100
Local: Fri, Jan 26 2007 9:28 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

On Jan 26, 2007, at 11:47 AM, Michael Radziej wrote:

> # 1356 sets the charset attribute of the mysql backend connection to
> 'utf8' for mysql version >= 4.1

And leaves everyone who wants to operate in 8 bits out in the cold.  
Where they actually ought to be anyway, but I tried to stay liberal  
in 952 - primarily because
it's still unknown how Django authors want to approach this.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Radziej  
View profile  
 More options Jan 27 2007, 5:22 am
From: Michael Radziej <m...@noris.de>
Date: Sat, 27 Jan 2007 11:22:05 +0100
Local: Sat, Jan 27 2007 5:22 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Hi,

in these tickets, are we talking about the encoding used in the database
connection for the communication between django and the database, the
encoding of how the database stores its data, or the encoding in which
the templates are stored? Or even the encoding used in the http transaction?

I thought it's only the first one. But when I read this thread, you're
touching everything and even ruby, and I'm now completely confused.

Can you help me with a few answers?

1. Are all these tickets really about the connection encoding?

2. If so, what's the problem of using utf8 for the connection for
everybody? I don't see how this would be a problem for anybody who is
using a different encoding for templates, within the database's storage
or else, since there's no loss in converting anything into utf8. Or is
there?

So long,

Michael


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ak  
View profile  
 More options Jan 27 2007, 5:23 am
From: "ak" <an...@khalikov.ru>
Date: Sat, 27 Jan 2007 02:23:13 -0800
Local: Sat, Jan 27 2007 5:23 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Guys

Could someone please explain me what was a problem with unicode support
in oldforms so newforms have been made with unicode inside ?
Kick me if I wrong but what is a real reason to convert bytes back and
forth ? Religion ? I agree with everyone who says that unicode is a
must and 'legacy' charsets are crap but guys I already have a BIG
application that was about 80% migrated from other python frameworks to
django some time ago and for legacy reasons it was all in national
charset, not unicode. Then I found that oldforms support will be
dropped soon or later. So we at here have decided to start moving (yes,
moving again !!!) all our code to newforms and what we got ? We got
that we now have to recode everything to utf-8 and search for bugs in
over than 10k lines of our oldforms-based code until we move everything
to newforms and utf-8. But really why ?
Did anyone who used unicode with oldform has any problems ? I am sure
noone did.
Did anyone who used native encodings with oldforms has any problems
(except of patch against one line of code I dscribed before or #952) ?
Noone did.

So guys please explain me what was a reason to make me to migrate to
unicode ?

Django is a web framework for perfectionists with deadlines. I see may
perfectionists here but what about deadlines ?

My opinion is simple: let's decide once ether django is for unicode or
django supports both unicode and national charsets and then let's work.
If you tell me that from now there is only "unicode future" i'd agree
and start searching for bugs and sending patches like  #3370


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ak  
View profile  
 More options Jan 27 2007, 5:28 am
From: "ak" <an...@khalikov.ru>
Date: Sat, 27 Jan 2007 02:28:07 -0800
Local: Sat, Jan 27 2007 5:28 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Michael, of you read again the topic about euro sign in newforms you
can find that this touches everything. Personally I couldn't find a way
to use utf-8 to connect MySQL and keep using cp1251 in my templates: it
basically doesn't work. With my patch (#3370) and utf8 everywhere it
does.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Jan 27 2007, 9:44 am
From: Ivan Sagalaev <Man...@SoftwareManiacs.Org>
Date: Sat, 27 Jan 2007 17:44:46 +0300
Local: Sat, Jan 27 2007 9:44 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

ak wrote:
> Could someone please explain me what was a problem with unicode support
> in oldforms so newforms have been made with unicode inside ?

I can! The thing is it has absolutely nothing to do with forms, it's
just historical coincidence.

Originally Django was written with using byte strings everywhere and
there were no such thing as "conversion problem". However there were
problems with incorrect string operations on byte strings (maxlength
counting, upper/lower casing, etc.) Some time ago there was a decision
to convert Django to work internally with unicode strings and convert
them into byte strings on boundaries to the web and to the database. And
there were no such thing as newforms at that moment.

And then Adrian started to implement newforms and he has chosen to do
its internal in unicode, for compatibility with Django's future as I
understand it.

> Kick me if I wrong but what is a real reason to convert bytes back and
> forth ? Religion ?

Reasons are purely technical... I'll list them but please do read until
the end of the letter before you disagree. I believe you just
misunderstand some things about unicode.

1. Unicode is a universal encoding that can store all characters.
Without universal encoding an app written by a Russian programmer
wouldn't be able to use a library written by a French programmer. This
is why we need unicode.

2. In Python unicode strings can be either 'unicode' objects or
byte-strings encoded in utf-8. The problem with utf-8 is that you can't
string operations with it. For example you can't cut a month's name to 3
letter just by doing month[0:3] because letters can occupy different
count of bytes. This is what unicode objects are for and why Django
internally should work with unicode.

May I recommend you my post about unicode and bytes (it's in russian):
http://softwaremaniacs.org/blog/2006/07/28/unicode-and-bytes/

> I agree with everyone who says that unicode is a
> must and 'legacy' charsets are crap but guys I already have a BIG
> application that was about 80% migrated from other python frameworks to
> django some time ago and for legacy reasons it was all in national
> charset, not unicode.

What gives you an idea that Django won't work with this data? All this
unicode stuff is purely internal. If you want your app to output
windows-1251, set DEFAULT_CHARSET to windows-1251 and data would be
automatically converted from and to it. I believe even newforms already
use this setting to convert unicode data for templates (if not it should
be just fixed and I'm happy to make a patch since I got some free time).

> Then I found that oldforms support will be
> dropped soon or later. So we at here have decided to start moving (yes,
> moving again !!!) all our code to newforms and what we got ? We got
> that we now have to recode everything to utf-8

Sure not :-). I'd say it would be wise thing to do *eventually*. But for
now you absolutely can keep your templates and python sources in
windows-1251.

> Did anyone who used unicode with oldform has any problems ? I am sure
> noone did.

In fact nobody used unicode with old forms. All things in request.POST,
manipulator.flatten_data and in db models were always in byte strings
(except db models with psycopg2).

And there were problems with it. They were just fixed very early (a
couple of them by yours truly).

> So guys please explain me what was a reason to make me to migrate to
> unicode ?

I still think that you're confusing migrating Django internals to
unicode objects and converting your files to utf-8. It's not about the
latter.

> My opinion is simple: let's decide once ether django is for unicode or
> django supports both unicode and national charsets and then let's work.

Sure Django does and will support national charsets. This is why we have
DEFAULT_CHARSET setting. Internal unicode just lets Django have all the
encode/decode stuff localized in two places instead of littered all over
the code.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Jan 27 2007, 10:10 am
From: Ivan Sagalaev <Man...@SoftwareManiacs.Org>
Date: Sat, 27 Jan 2007 18:10:48 +0300
Local: Sat, Jan 27 2007 10:10 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

Michael Radziej wrote:
> 1. Are all these tickets really about the connection encoding?

> 2. If so, what's the problem of using utf8 for the connection for
> everybody? I don't see how this would be a problem for anybody who is
> using a different encoding for templates, within the database's storage
> or else, since there's no loss in converting anything into utf8. Or is
> there?

I agree with the 2nd point. You still can run into a theoretical problem
with it in a scenario when an input is richer than a storage:

- a database that is internally stores data in a legacy encoding (say
iso-8859-1)
- a web frontend that talks utf-8
- a user enters, say, Russian characters into a form
- data travels as utf-8 right until db where it will fail to encode them
in iso-8859-1 because it doesn't have place for Russian characters

But it's indeed a very theoretical case. Most legacy system use the same
legacy encoding for both backend and frontend and there would be no
errors in the path: legacy (web) - unicode (newforms) - utf-8 (db
connection) - legacy (db)


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
philipp.kel...@gmail.com  
View profile  
 More options Jan 27 2007, 10:58 am
From: philipp.kel...@gmail.com
Date: Sat, 27 Jan 2007 16:58:28 +0100
Local: Sat, Jan 27 2007 10:58 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Hi Ivan

Thank you very much for making things very clear here.

It seems the whole issue cryes for a unification of the whole django source before the 1.0 release, or do I misinterpret?

Do you know which parts of django still use bytecode strings?

greets
Philipp


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Jan 27 2007, 11:53 am
From: Ivan Sagalaev <Man...@SoftwareManiacs.Org>
Date: Sat, 27 Jan 2007 19:53:15 +0300
Local: Sat, Jan 27 2007 11:53 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

philipp.kel...@gmail.com wrote:
> Hi Ivan

> Thank you very much for making things very clear here.

I actually thought I make everyone angry with my constant bugging about
these things :-)

> Do you know which parts of django still use bytecode strings?

A better person to ask is Gábor Farkas who was about to deal with
unicodification and who actually made patches for newforms to play nice
with templates.

And a better question to ask would be which parts of Django are unicode
already. Because it's only newforms basically. Other major parts -- ORM
models and templates -- that should work internally in unicode were
never converted yet.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ak  
View profile  
 More options Jan 27 2007, 12:44 pm
From: "ak" <an...@khalikov.ru>
Date: Sat, 27 Jan 2007 09:44:31 -0800
Local: Sat, Jan 27 2007 12:44 pm
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

> Do you know which parts of django still use bytecode strings?

As far as I know, it's only newforms and this is why the topic was
born: at the moment newforms work right until you want to put any non-
latin1 character through them to db. I've done the patch (#3370) which
fixes the issue but it only works when your templates and your db are
all in utf-8. And Ivan says this is wrong. From one hand I agree with
him: this is not right. From other hand, I open the page http://
www.djangoproject.com/documentation/newforms/ and there's written: "We
will REMOVE django.oldforms in the release ...". So if I start a new
project based on django, or I extend existing project, there is very
strong reason for me to use newforms, BUT they don't work. Confused ?
Me too :(

Now I would like someone to explain me a few things before I start to
do next patches :)
1. newforms are with unicode inside
2. ORM is with str inside
Should we (me, someone other) patch ORM to make it store everything in
unicode inside it too, or at the moment unicode must be only inside
newforms and newforms.model.save() must be fixed to put bytestring
decoded data to models ?

And another thing I still don't understand is: let's pretend I use
MySQL 4.0 with national charset and my templates are in the same
charset too. How would work:

> the path: legacy (web) - unicode (newforms) - utf-8 (db connection) - legacy (db)

specially the part "utf-8 (db connection)" ? In this situation we must
convert strings to our app's encoding at the python side because our
legacy db can't do it itself. But we use utf-8 for connection so who
and where should do this conversion ?

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Jan 27 2007, 2:09 pm
From: Ivan Sagalaev <Man...@SoftwareManiacs.Org>
Date: Sat, 27 Jan 2007 22:09:38 +0300
Local: Sat, Jan 27 2007 2:09 pm
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

ak wrote:
> So if I start a new
> project based on django, or I extend existing project, there is very
> strong reason for me to use newforms, BUT they don't work. Confused ?
> Me too :(

Actually it is exactly like this because newforms are not ready. And
unicode issue is not the only one. Newforms now just plain lack some
functionality for example... And this is not at all strange since
Django's trunk is a development version and, while stable to run, is not
stable from the API point of view. This is a tough time like it was just
before magic removal: an old API is going to be obsolete soon and new
one is not ready. It's up to developer to make a decision.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ak  
View profile  
 More options Jan 27 2007, 3:03 pm
From: "ak" <an...@khalikov.ru>
Date: Sat, 27 Jan 2007 12:03:19 -0800
Local: Sat, Jan 27 2007 3:03 pm
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
After some thoughts I came to the following conclusion: if you guys
want to keep support of legacy charsets in fact you don't have to
force model objects too be unicoded. Firstly, they are passed to
templates and filters and we can't mix legacy charsets with unicode in
one template. Next, if I don't use unicode, I don't have to code my
python sources (views) in unicode. So, I need to be able to pass
string values into my model objects and my strings are not unicoded.

So if everyone agreed, the way is simple:
1. when django loads data from db and fills in a model object, all
strings have to be encoded according to DEFAULT_CHARSET
2. when django passes data from form object to model object, it has to
encode strings according to DEFAULT_CHARSET again

In fact, my patch #3370 is wrong then, actually newforms.model.save()
method should be patched to recode clean_data from unicode to
DEFAULT_CHARSET (if it differs) when passing this data to model object
and for now we would get everything in place: utf8-based templates and
legacy-charset-based templates would be both correctly supported and
any national characters would be stored in db perfectly as they do now
with oldforms (ofcourse remember what I said about #952)
And the second required patch is about recoding unicode strings loaded
from db to DEFAULT_CHARSET (if differs) when passing them to model
objects and back from DEFAULT_CHARSET to unicode when we save model
objects to db. This patch will solve #952 issue and again it will work
ok with both unicode and legacy-charset based templates.
And even more here: if we have a legacy database which doesn't
understand unicode, we can realize this fact immediately after
connecting to db and decide the correct way to decode/encode strings.

As I see, this way fixes all unicode/charsets issues and answers all
questions. So, if there are no objections, I can write this patch
tomorrow or by monday.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bjørn Stabell  
View profile  
 More options Jan 27 2007, 11:46 pm
From: "Bjørn Stabell" <bjo...@gmail.com>
Date: Sat, 27 Jan 2007 20:46:20 -0800
Local: Sat, Jan 27 2007 11:46 pm
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
On Jan 28, 4:03 am, "ak" <a...@khalikov.ru> wrote:

> After some thoughts I came to the following conclusion: if you guys
> want to keep support of legacy charsets in fact you don't have to
> force model objects too be unicoded. Firstly, they are passed to
> templates and filters and we can't mix legacy charsets with unicode in
> one template. Next, if I don't use unicode, I don't have to code my
> python sources (views) in unicode. So, I need to be able to pass
> string values into my model objects and my strings are not unicoded.

> So if everyone agreed, the way is simple:
> 1. when django loads data from db and fills in a model object, all
> strings have to be encoded according to DEFAULT_CHARSET
> 2. when django passes data from form object to model object, it has to
> encode strings according to DEFAULT_CHARSET again

This is quite confusing.  It seems you're advocating decoding/encoding
multiple times.  Being a Norwegian involved in web development in
China, I love Unicode, and I've been fighting with it for 6-7 years.  
This is what I've learned:

1) Unicode != external character encoding.  All programming languages
have an internal unicode representation, and all code that needs to
understand the concept of a "character" deals with this; e.g.,
lowercasing, sorting.  You never worry what this representation is
(you're assuming too much about the programming language if you do).  
Instead you:

  decode from a character encoding (e.g., UTF-8, ISO8859-1, GB18030)
into this representation
  encode this internal representation into an character encoding

UTF-8, UTF-16 are character encodings.  GB18030 is a Chinese character
encoding that is just as capable of representing all the code points
in the Unicode standard, same as UTF-8 and UTF-16.  Older encodings
are usually language/locale specific, so they can only represent a
small subset of the code points (characters) in Unicode.

I'm not sure what "unicoding", "unicodifying" means.  Is it decoding
into the internal unicode representation, or the process of making
your code unicode aware and compatible?

Joel has a nicely written intro: http://www.joelonsoftware.com/
articles/Unicode.html

2) Unicode is an all-or-nothing thing (not obvious).  If you try to
use it partly, sometimes, or only somewhere, you'll end up with
UnicodeErrors popping up everywhere and a very inefficient
architecture with multiple encoding/decodings happenings during each
request...  Oh this module doesn't do Unicode, better give it UTF-8,
but then it has to pass something back, which should be of type
unicode, but it doesn't know which character encoding we're using so
then I have to pass that to it, ... ad nauseam.

3) Doing Unicode is (I think) worthwhile, but it is a tradeoff:
everyone suddenly have to understand and deal with character encoding
issues, and there's a slight performance penalty.  It's practically
impossible to have Unicode without making these tradeoffs.  (That
said, many environment have made these tradeoffs successfully, e.g.,
Java, C#.)  Only doing decoding/encoding at the I/O edges reduces the
pain, however.

Rgds,
Bjorn


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ak  
View profile  
 More options Jan 28 2007, 1:02 am
From: "ak" <an...@khalikov.ru>
Date: Sat, 27 Jan 2007 22:02:08 -0800
Local: Sun, Jan 28 2007 1:02 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Bjorn, if you read my first messages and specially my patch #3370, you
find that I made a suggestion that if the guys want to move to unicode
they better drop all native encodings support and so does my patch.
Then people started to answer me that this is wrong. And at the moment
noone is able to explain the whole thing and answer my quesions:
1. how do they want to support templates and python code (views/
scripts) in native encodings if django itself would be all in unicode.
The only way i see is to encode/decode everything at programmer's end
and this means for me no native encodings support at all.
2. how do they want to support legacy databases if db connection
speaks unicode

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bjørn Stabell  
View profile  
 More options Jan 28 2007, 1:47 am
From: "Bjørn Stabell" <bjo...@gmail.com>
Date: Sat, 27 Jan 2007 22:47:25 -0800
Local: Sun, Jan 28 2007 1:47 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
On Jan 28, 2:02 pm, "ak" <a...@khalikov.ru> wrote:

> Bjorn, if you read my first messages and specially my patch #3370, you
> find that I made a suggestion that if the guys want to move to unicode
> they better drop all native encodings support and so does my patch.

You mean require all I/O edge/boundary points to convert to/from
Python unicode strings?  (We'll of course need to support non-UTF
character encodings in databases, files, the web, etc.)

> Then people started to answer me that this is wrong. And at the moment
> noone is able to explain the whole thing and answer my quesions:
> 1. how do they want to support templates and python code (views/
> scripts) in native encodings if django itself would be all in unicode.
> The only way i see is to encode/decode everything at programmer's end
> and this means for me no native encodings support at all.

Support for Unicode strings (u"") in code is described in PEP-263,
e.g.,

  #!/usr/bin/python
  # -*- coding: <encoding name> -*-

Unfortunately it's not implemented yet (AFAIK), so you can't just have
unescaped literals:

  s = u"encoded text goes here"     # doesn't work yet; pending
PEP-263

An alternative for literals in code is to surround them with unicode()
and specify the appropriate encoding:

  s = unicode("encoded text goes here", "encoding name")

An even better way is to externalize all strings in .po files and use
gettext, which has some support for returning unicode strings.

I guess templates could have their character encoding identified
either through a similar mechanism, through a global settings
variable, or just use the system default encoding.

> 2. how do they want to support legacy databases if db connection speaks unicode

I'm not sure I can follow you.  How to configure a database adapter
depends on the database and adapter you're using.  Some can accept
unicode strings; for those that don't I guess you'll need a wrapper of
some sort.

Rgds,
Bjorn


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Michael Radziej  
View profile  
 More options Jan 28 2007, 2:51 am
From: Michael Radziej <m...@noris.de>
Date: Sun, 28 Jan 2007 08:51:55 +0100
Local: Sun, Jan 28 2007 2:51 am
Subject: Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users
Hi,

ak schrieb:

> After some thoughts I came to the following conclusion: if you guys
> want to keep support of legacy charsets in fact you don't have to
> force model objects too be unicoded. Firstly, they are passed to
> templates and filters and we can't mix legacy charsets with unicode in
> one template. Next, if I don't use unicode, I don't have to code my
> python sources (views) in unicode. So, I need to be able to pass
> string values into my model objects and my strings are not unicoded.

> So if everyone agreed, the way is simple:
> 1. when django loads data from db and fills in a model object, all
> strings have to be encoded according to DEFAULT_CHARSET
> 2. when django passes data from form object to model object, it has to
> encode strings according to DEFAULT_CHARSET again

This thread is moving more and more away the tickets. I started it to
get some help in deciding how to proceed with these ...

Regarding ak's proposal, this is going against a widely shared agreement
within the python world that applications should internally use unicode
strings (not: utf8 strings) and decode/encode to a bytestring at the
boundaries, which is usually input/output, or for database applications
it's the communication between the database backend (e.g. MySQLdb) and
the database. I'm not in a position to make any decisions for django,
but I'm pretty sure that you cannot convince the core developers to
follow your path.

Down to earth and back to tickets, my current understanding is this:

The problem that started the original thread in django-users was that
the MySQLdb backend thought it was using latin-1 encoding for the
connection and therefore could not encode '€', which is in iso-8859-15
but not in iso-8859-1 aka iso-latin-1. Ticket #2896 seems to explain how
this can happen.

In my opinion, each of the three tickets in the subject should solve
this issue, and none tries to cope with templates written in a different
encoding than settings.DEFAULT_ENCODING.

#952 allows to use a different encoding on the connection than
settings.DEFAULT_CHARSET. It does it for all backends.

#1365 sets connection.charset in the mysql backend to utf8. This makes
the MySQLdb use utf8 encoding, but it's hackish and has been reported
not to work in all environments.

#3370 opens the mysql backend connection with charset='utf8', which
seems a cleaner way to do the same as #1365. It also fixes the __repr__
of models (not sure if this is the best way, but this can be added to
any of the other patches)

My bottom line is that #952 has a different scope than the other two
tickets, and that #1365 should be closed as duplicate of #3370. #3370
and #952 can co-exist.

So, would anybody object against closing #1365 and promoting #952 and
#3370 to "Accepted" (which was their state before we started this
discussion)?

Michael


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 55   Newer >
« Back to Discussions « Newer topic     Older topic »

Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google