Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Swedish characters in Python strings

3,468 views
Skip to first unread message

Urban Anjar

unread,
Oct 12, 2002, 1:44:29 PM10/12/02
to
Hi,
I have found something that looks like a bug, or at least a not so
pleasant feature. In Swedish we often use the characters å, ä and ö (a
with a ring, a with two dots and o with two dots) and I don't get them
to work perfectly
well in Python.

Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> S = 'abc'
>>> print S
abc
>>> print len(S)
3

That is perfectly OK, but...

>>> S = 'åäö'
>>> print S
åäö
>>> print len(S)
6

Look at this code snippet:

#!/usr/bin/python
def rev(S):
if S:
return S[-1] + rev(S[:-1])
else:
return ''

str = 'abcåäö'
print rev(str)

Running it gives:
[urban@falcon urban]$ ./rev
?äå?cba

Of course I can analyze how characters are representated in detail and
make
some kind of workaround, but I think this is not the Python way. In
assembler or C I have to think of things like that but do I have to do
that in Python?

Another example:

>>> L = ['Åke','Ärla','Östen']
>>> print L
['\xc3\x85ke', '\xc3\x84rla', '\xc3\x96sten']

Please let me know if I do something wrong or if you too think
about this as a bug.

Sincerely,
Urban Anjar

Stefan

unread,
Oct 12, 2002, 3:17:14 PM10/12/02
to
Urban Anjar wrote:
> Hi,
> I have found something that looks like a bug, or at least a not so
> pleasant feature. In Swedish we often use the characters å, ä and ö (a
> with a ring, a with two dots and o with two dots) and I don't get them
> to work perfectly
> well in Python.
>
> Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
> [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>>
>>>> S = 'abc'
>>>> print S
> abc
>>>> print len(S)
> 3
>
> That is perfectly OK, but...
>
>>>> S = 'åäö'
>>>> print S
> åäö
>>>> print len(S)
> 6

åäö cannot be represented with single bytes in Unicode, thats why åäö is
multibyte characters.

Learn more about unicode on this page:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Fredrik Lundh

unread,
Oct 13, 2002, 7:22:36 AM10/13/02
to
Urban Anjar wrote:

> That is perfectly OK, but...
>
> >>> S = 'åäö'
> >>> print S
> åäö
> >>> print len(S)
> 6

on all machines I have access to, I get:

>>> S = "åäö"
>>> print S
åäö
>>> print len(S)

3

check the locale settings; to minimize the pain, make sure you use
an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
use-only variable-width encoding like UTF-8.

with UTF-8, your operating system is messing things up before Python
gets a chance to look at the characters (most likely, Python gets 6
characters from the keyboard, and sends 6 characters to the console).

if you cannot get RedHat to behave intelligentely, use a decent editor
instead.

(avoiding RedHat 8.0 might also help. based on the kind of bugs I've
experienced this far, 8.0 might qualify as the worst unix-like operating
system ever released...)

</F>


Urban Anjar

unread,
Oct 13, 2002, 3:30:46 PM10/13/02
to
"Fredrik Lundh" <fre...@pythonware.com> wrote in message news:<0Icq9.2091$MV.8...@newsc.telia.net>...
(...)

> check the locale settings; to minimize the pain, make sure you use
> an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> use-only variable-width encoding like UTF-8.
>
> with UTF-8, your operating system is messing things up before Python
> gets a chance to look at the characters (most likely, Python gets 6
> characters from the keyboard, and sends 6 characters to the console).

I have to try that too, found another solution that imho
is very kludgey... Thought this could be a simple example
for beginners but it grew to almost a-whole-weekend-hack...

8<---------------------------------
#!/usr/bin/python
# Fixed some åäö-problems
import sys

def rev(S):
if S:
return S[-1] + rev(S[:-1])
else:
return ''

arg_list = sys.argv[1:]

for str in arg_list:
str = unicode(str,"utf-8") # ugly fix 1
str = rev(str)
print str.encode("utf-8"), # ugly fix 2
print

8<---------------------------------

I think Python should take care of that kind of conversions
behind the scene...

By the way, is there a method to test strings for how they are
coded before messing with them in a program.

Urban


PS


> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)

I sort of like it, anyway, but waiting for 8.1 or 8.2 may be a good
idea. They have changed a lot including going from Python 1.5.2 to 2.2.1
But this is not the right place for an OS-war...

/U

Martin v. Loewis

unread,
Oct 13, 2002, 3:38:40 PM10/13/02
to
urban...@hik.se (Urban Anjar) writes:

> I think Python should take care of that kind of conversions
> behind the scene...

Python cannot really do this, both because it is not possible, and
because it would break tons of existing code.

> By the way, is there a method to test strings for how they are
> coded before messing with them in a program.

That's part of the problem: this is not possible. If it were possible,
Python would be doing it behind the scene.

Regards,
Martin

Brian Quinlan

unread,
Oct 13, 2002, 5:13:38 PM10/13/02
to
Urban wrote:
> I think Python should take care of that kind of conversions
> behind the scene...

How would Python do this?

> By the way, is there a method to test strings for how they are
> coded before messing with them in a program.

No. Given a sequence of bytes, it is pretty tough to accurately
determine the encoding. There are libraries that try, but they don't
always succeed.

Cheers,
Brian


maxm

unread,
Oct 14, 2002, 8:00:55 AM10/14/02
to
Fredrik Lundh wrote:

> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)

Do you have any specifics or links to anything? My win2K server just
made metal dust out of the c: drive so I was about to install 8.0 on a
new disk instead.

You would advice against it?

regards Max M

Magnus Heino

unread,
Oct 13, 2002, 9:39:50 AM10/13/02
to

> check the locale settings; to minimize the pain, make sure you use
> an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> use-only variable-width encoding like UTF-8.

Still, all new RH8 installs do use utf-8, and there must be a good reason
for that, and I guess its something they will do for a while now...



> with UTF-8, your operating system is messing things up before Python
> gets a chance to look at the characters (most likely, Python gets 6
> characters from the keyboard, and sends 6 characters to the console).

Is that the reason in this case? This is RH8.0, en_US.utf-8

Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import MP3Info
>>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
>>> title
'K\xf6ttbullar i n\xe4san'
>>> print title
K?ttbullar i n?san
>>> type(title)
<type 'str'>
>>> print title
K?ttbullar i n?san
>>> print u'K\xf6ttbullar i n\xe4san'.encode('utf-8')
Köttbullar i näsan
>>>

> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)

Besides this stuff, I think it's really nice..

--

/Magnus

Martin v. Löwis

unread,
Oct 15, 2002, 11:20:56 AM10/15/02
to
Magnus Heino <magnus...@pleon.sigma.se> writes:

> > check the locale settings; to minimize the pain, make sure you use
> > an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> > use-only variable-width encoding like UTF-8.
>
> Still, all new RH8 installs do use utf-8, and there must be a good reason
> for that, and I guess its something they will do for a while now...

I disagree with Fredrik that UTF-8 is for internal use only. Using
UTF-8 locales is the only way to solve several aspects of Unix
internationalization, in particular supporting non-ASCII file names,
and supporting non-ASCII configuration files (specifically /etc/passwd).

> >>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
> >>> title
> 'K\xf6ttbullar i n\xe4san'
> >>> print title
> K?ttbullar i n?san

In this case, it appears that the title in the MP3 file is encoded in
Latin-1, not in UTF-8. Your terminal expects UTF-8. The data you print
are invalid UTF-8, so the terminal refuses to display them. To print
the data properly in your terminal, do

print unicode(title, "iso-8859-1").encode("utf-8")

Again, there is nothing that Python can do about that: It is not
possible to know what encoding title has - it could just as well be,
say, KOI8-R (in which case \xf6 would be CYRILLIC CAPITAL LETTER ZHE,
not LATIN SMALL LETTER O WITH DIAERESIS).

> Besides this stuff, I think it's really nice..

I find it a reasonable decision to suggest users to use UTF-8 as their
default encoding, irrespective of the language they
speak. Unfortunately, many applications are not really prepared for
multi-byte encodings, but those applications must be corrected.

Regards,
Martin

Bengt Richter

unread,
Oct 15, 2002, 4:09:55 PM10/15/02
to
On Sun, 13 Oct 2002 15:39:50 +0200, Magnus Heino <magnus...@pleon.sigma.se> wrote:
>Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
>[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
>Type "help", "copyright", "credits" or "license" for more information.
>>>> import MP3Info
>>>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
>>>> title
>'K\xf6ttbullar i n\xe4san'
>>>> print title
>K?ttbullar i n?san
>>>> type(title)
><type 'str'>
>>>> print title
>K?ttbullar i n?san
>>>> print u'K\xf6ttbullar i n\xe4san'.encode('utf-8')
>Köttbullar i näsan
>>>>
[OT] LOL! Is that a real title ?

Regards,
Bengt Richter

Brian Quinlan

unread,
Oct 15, 2002, 4:31:26 PM10/15/02
to
Magnus wrote:
: >>> import MP3Info

: >>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
: >>> title
: 'K\xf6ttbullar i n\xe4san'

I took a quick look at the ID3 specification reveals that Unicode was
not introduced until ID3v2, so determining the encoding before that was
not possible.

With ID3v2, the possible encodings are ISO-8859-1, UTF-16 + BOM,
UTF-16BE (no BOM) and UTF-8.

When reading ID3v2, MP3Info should present the title as a Unicode object
(though I have no idea if it actually does this or not).

When reading ID3v[0,1], MP3Info would have no choice but to present the
title as a byte string.

Cheers,
Brian


0 new messages