Swedish characters in Python strings

Urban Anjar

unread,

Oct 12, 2002, 1:44:29 PM10/12/02

to

Hi,
I have found something that looks like a bug, or at least a not so
pleasant feature. In Swedish we often use the characters å, ä and ö (a
with a ring, a with two dots and o with two dots) and I don't get them
to work perfectly
well in Python.

Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> S = 'abc'
>>> print S
abc
>>> print len(S)
3

That is perfectly OK, but...

>>> S = 'åäö'
>>> print S
åäö
>>> print len(S)
6

Look at this code snippet:

#!/usr/bin/python
def rev(S):
if S:
return S[-1] + rev(S[:-1])
else:
return ''

str = 'abcåäö'
print rev(str)

Running it gives:
[urban@falcon urban]$ ./rev
?äå?cba

Of course I can analyze how characters are representated in detail and
make
some kind of workaround, but I think this is not the Python way. In
assembler or C I have to think of things like that but do I have to do
that in Python?

Another example:

>>> L = ['Åke','Ärla','Östen']
>>> print L
['\xc3\x85ke', '\xc3\x84rla', '\xc3\x96sten']

Please let me know if I do something wrong or if you too think
about this as a bug.

Sincerely,
Urban Anjar

Stefan

unread,

Oct 12, 2002, 3:17:14 PM10/12/02

to

Urban Anjar wrote:
> Hi,
> I have found something that looks like a bug, or at least a not so
> pleasant feature. In Swedish we often use the characters å, ä and ö (a
> with a ring, a with two dots and o with two dots) and I don't get them
> to work perfectly
> well in Python.
>
> Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
> [GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>>
>>>> S = 'abc'
>>>> print S
> abc
>>>> print len(S)
> 3
>
> That is perfectly OK, but...
>
>>>> S = 'åäö'
>>>> print S
> åäö
>>>> print len(S)
> 6

åäö cannot be represented with single bytes in Unicode, thats why åäö is
multibyte characters.

Learn more about unicode on this page:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Fredrik Lundh

unread,

Oct 13, 2002, 7:22:36 AM10/13/02

to

Urban Anjar wrote:

> That is perfectly OK, but...
>
> >>> S = 'åäö'
> >>> print S
> åäö
> >>> print len(S)
> 6

on all machines I have access to, I get:

>>> S = "åäö"
>>> print S
åäö
>>> print len(S)

3

check the locale settings; to minimize the pain, make sure you use
an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
use-only variable-width encoding like UTF-8.

with UTF-8, your operating system is messing things up before Python
gets a chance to look at the characters (most likely, Python gets 6
characters from the keyboard, and sends 6 characters to the console).

if you cannot get RedHat to behave intelligentely, use a decent editor
instead.

(avoiding RedHat 8.0 might also help. based on the kind of bugs I've
experienced this far, 8.0 might qualify as the worst unix-like operating
system ever released...)

</F>

Urban Anjar

unread,

Oct 13, 2002, 3:30:46 PM10/13/02

to

"Fredrik Lundh" <fre...@pythonware.com> wrote in message news:<0Icq9.2091$MV.8...@newsc.telia.net>...
(...)

> check the locale settings; to minimize the pain, make sure you use
> an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> use-only variable-width encoding like UTF-8.
>
> with UTF-8, your operating system is messing things up before Python
> gets a chance to look at the characters (most likely, Python gets 6
> characters from the keyboard, and sends 6 characters to the console).

I have to try that too, found another solution that imho
is very kludgey... Thought this could be a simple example
for beginners but it grew to almost a-whole-weekend-hack...

8<---------------------------------
#!/usr/bin/python
# Fixed some åäö-problems
import sys

def rev(S):
if S:
return S[-1] + rev(S[:-1])
else:
return ''

arg_list = sys.argv[1:]

for str in arg_list:
str = unicode(str,"utf-8") # ugly fix 1
str = rev(str)
print str.encode("utf-8"), # ugly fix 2
print

8<---------------------------------

I think Python should take care of that kind of conversions
behind the scene...

By the way, is there a method to test strings for how they are
coded before messing with them in a program.

Urban

PS

> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)

I sort of like it, anyway, but waiting for 8.1 or 8.2 may be a good
idea. They have changed a lot including going from Python 1.5.2 to 2.2.1
But this is not the right place for an OS-war...

/U

Martin v. Loewis

unread,

Oct 13, 2002, 3:38:40 PM10/13/02

to

urban...@hik.se (Urban Anjar) writes:

> I think Python should take care of that kind of conversions
> behind the scene...

Python cannot really do this, both because it is not possible, and
because it would break tons of existing code.

> By the way, is there a method to test strings for how they are
> coded before messing with them in a program.

That's part of the problem: this is not possible. If it were possible,
Python would be doing it behind the scene.

Regards,
Martin

Brian Quinlan

unread,

Oct 13, 2002, 5:13:38 PM10/13/02

to

Urban wrote:
> I think Python should take care of that kind of conversions
> behind the scene...

How would Python do this?

> By the way, is there a method to test strings for how they are
> coded before messing with them in a program.

No. Given a sequence of bytes, it is pretty tough to accurately
determine the encoding. There are libraries that try, but they don't
always succeed.

Cheers,
Brian

maxm

unread,

Oct 14, 2002, 8:00:55 AM10/14/02

to

Fredrik Lundh wrote:

> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)

Do you have any specifics or links to anything? My win2K server just
made metal dust out of the c: drive so I was about to install 8.0 on a
new disk instead.

You would advice against it?

regards Max M

Magnus Heino

unread,

Oct 13, 2002, 9:39:50 AM10/13/02

to

> check the locale settings; to minimize the pain, make sure you use
> an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> use-only variable-width encoding like UTF-8.

Still, all new RH8 installs do use utf-8, and there must be a good reason
for that, and I guess its something they will do for a while now...

> with UTF-8, your operating system is messing things up before Python
> gets a chance to look at the characters (most likely, Python gets 6
> characters from the keyboard, and sends 6 characters to the console).

Is that the reason in this case? This is RH8.0, en_US.utf-8

Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import MP3Info
>>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
>>> title
'K\xf6ttbullar i n\xe4san'
>>> print title
K?ttbullar i n?san
>>> type(title)
<type 'str'>
>>> print title
K?ttbullar i n?san
>>> print u'K\xf6ttbullar i n\xe4san'.encode('utf-8')
Köttbullar i näsan
>>>

> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)

Besides this stuff, I think it's really nice..

--

/Magnus

Martin v. Löwis

unread,

Oct 15, 2002, 11:20:56 AM10/15/02

to

Magnus Heino <magnus...@pleon.sigma.se> writes:

> > check the locale settings; to minimize the pain, make sure you use
> > an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> > use-only variable-width encoding like UTF-8.
>
> Still, all new RH8 installs do use utf-8, and there must be a good reason
> for that, and I guess its something they will do for a while now...

I disagree with Fredrik that UTF-8 is for internal use only. Using
UTF-8 locales is the only way to solve several aspects of Unix
internationalization, in particular supporting non-ASCII file names,
and supporting non-ASCII configuration files (specifically /etc/passwd).

> >>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
> >>> title
> 'K\xf6ttbullar i n\xe4san'
> >>> print title
> K?ttbullar i n?san

In this case, it appears that the title in the MP3 file is encoded in
Latin-1, not in UTF-8. Your terminal expects UTF-8. The data you print
are invalid UTF-8, so the terminal refuses to display them. To print
the data properly in your terminal, do

print unicode(title, "iso-8859-1").encode("utf-8")

Again, there is nothing that Python can do about that: It is not
possible to know what encoding title has - it could just as well be,
say, KOI8-R (in which case \xf6 would be CYRILLIC CAPITAL LETTER ZHE,
not LATIN SMALL LETTER O WITH DIAERESIS).

> Besides this stuff, I think it's really nice..

I find it a reasonable decision to suggest users to use UTF-8 as their
default encoding, irrespective of the language they
speak. Unfortunately, many applications are not really prepared for
multi-byte encodings, but those applications must be corrected.

Regards,
Martin

Bengt Richter

unread,

Oct 15, 2002, 4:09:55 PM10/15/02

to

On Sun, 13 Oct 2002 15:39:50 +0200, Magnus Heino <magnus...@pleon.sigma.se> wrote:
>Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
>[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
>Type "help", "copyright", "credits" or "license" for more information.
>>>> import MP3Info
>>>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
>>>> title
>'K\xf6ttbullar i n\xe4san'
>>>> print title
>K?ttbullar i n?san
>>>> type(title)
><type 'str'>
>>>> print title
>K?ttbullar i n?san
>>>> print u'K\xf6ttbullar i n\xe4san'.encode('utf-8')
>Köttbullar i näsan
>>>>

[OT] LOL! Is that a real title ?

Regards,
Bengt Richter

Brian Quinlan

unread,

Oct 15, 2002, 4:31:26 PM10/15/02

to

Magnus wrote:
: >>> import MP3Info

: >>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
: >>> title
: 'K\xf6ttbullar i n\xe4san'

I took a quick look at the ID3 specification reveals that Unicode was
not introduced until ID3v2, so determining the encoding before that was
not possible.

With ID3v2, the possible encodings are ISO-8859-1, UTF-16 + BOM,
UTF-16BE (no BOM) and UTF-8.

When reading ID3v2, MP3Info should present the title as a Unicode object
(though I have no idea if it actually does this or not).

When reading ID3v[0,1], MP3Info would have no choice but to present the
title as a byte string.

Cheers,
Brian