Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> S = 'abc'
>>> print S
abc
>>> print len(S)
3
That is perfectly OK, but...
>>> S = 'åäö'
>>> print S
åäö
>>> print len(S)
6
Look at this code snippet:
#!/usr/bin/python
def rev(S):
if S:
return S[-1] + rev(S[:-1])
else:
return ''
str = 'abcåäö'
print rev(str)
Running it gives:
[urban@falcon urban]$ ./rev
?äå?cba
Of course I can analyze how characters are representated in detail and
make
some kind of workaround, but I think this is not the Python way. In
assembler or C I have to think of things like that but do I have to do
that in Python?
Another example:
>>> L = ['Åke','Ärla','Östen']
>>> print L
['\xc3\x85ke', '\xc3\x84rla', '\xc3\x96sten']
Please let me know if I do something wrong or if you too think
about this as a bug.
Sincerely,
Urban Anjar
åäö cannot be represented with single bytes in Unicode, thats why åäö is
multibyte characters.
Learn more about unicode on this page:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
> That is perfectly OK, but...
>
> >>> S = 'åäö'
> >>> print S
> åäö
> >>> print len(S)
> 6
on all machines I have access to, I get:
>>> S = "åäö"
>>> print S
åäö
>>> print len(S)
3
check the locale settings; to minimize the pain, make sure you use
an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
use-only variable-width encoding like UTF-8.
with UTF-8, your operating system is messing things up before Python
gets a chance to look at the characters (most likely, Python gets 6
characters from the keyboard, and sends 6 characters to the console).
if you cannot get RedHat to behave intelligentely, use a decent editor
instead.
(avoiding RedHat 8.0 might also help. based on the kind of bugs I've
experienced this far, 8.0 might qualify as the worst unix-like operating
system ever released...)
</F>
I have to try that too, found another solution that imho
is very kludgey... Thought this could be a simple example
for beginners but it grew to almost a-whole-weekend-hack...
8<---------------------------------
#!/usr/bin/python
# Fixed some åäö-problems
import sys
def rev(S):
if S:
return S[-1] + rev(S[:-1])
else:
return ''
arg_list = sys.argv[1:]
for str in arg_list:
str = unicode(str,"utf-8") # ugly fix 1
str = rev(str)
print str.encode("utf-8"), # ugly fix 2
print
8<---------------------------------
I think Python should take care of that kind of conversions
behind the scene...
By the way, is there a method to test strings for how they are
coded before messing with them in a program.
Urban
PS
> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)
I sort of like it, anyway, but waiting for 8.1 or 8.2 may be a good
idea. They have changed a lot including going from Python 1.5.2 to 2.2.1
But this is not the right place for an OS-war...
/U
> I think Python should take care of that kind of conversions
> behind the scene...
Python cannot really do this, both because it is not possible, and
because it would break tons of existing code.
> By the way, is there a method to test strings for how they are
> coded before messing with them in a program.
That's part of the problem: this is not possible. If it were possible,
Python would be doing it behind the scene.
Regards,
Martin
How would Python do this?
> By the way, is there a method to test strings for how they are
> coded before messing with them in a program.
No. Given a sequence of bytes, it is pretty tough to accurately
determine the encoding. There are libraries that try, but they don't
always succeed.
Cheers,
Brian
> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)
Do you have any specifics or links to anything? My win2K server just
made metal dust out of the c: drive so I was about to install 8.0 on a
new disk instead.
You would advice against it?
regards Max M
Still, all new RH8 installs do use utf-8, and there must be a good reason
for that, and I guess its something they will do for a while now...
> with UTF-8, your operating system is messing things up before Python
> gets a chance to look at the characters (most likely, Python gets 6
> characters from the keyboard, and sends 6 characters to the console).
Is that the reason in this case? This is RH8.0, en_US.utf-8
Python 2.2.1 (#1, Aug 30 2002, 12:15:30)
[GCC 3.2 20020822 (Red Hat Linux Rawhide 3.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import MP3Info
>>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
>>> title
'K\xf6ttbullar i n\xe4san'
>>> print title
K?ttbullar i n?san
>>> type(title)
<type 'str'>
>>> print title
K?ttbullar i n?san
>>> print u'K\xf6ttbullar i n\xe4san'.encode('utf-8')
Köttbullar i näsan
>>>
> (avoiding RedHat 8.0 might also help. based on the kind of bugs I've
> experienced this far, 8.0 might qualify as the worst unix-like operating
> system ever released...)
Besides this stuff, I think it's really nice..
--
/Magnus
> > check the locale settings; to minimize the pain, make sure you use
> > an 8-bit encoding (e.g ISO-8859-1) and not a designed-for-internal-
> > use-only variable-width encoding like UTF-8.
>
> Still, all new RH8 installs do use utf-8, and there must be a good reason
> for that, and I guess its something they will do for a while now...
I disagree with Fredrik that UTF-8 is for internal use only. Using
UTF-8 locales is the only way to solve several aspects of Unix
internationalization, in particular supporting non-ASCII file names,
and supporting non-ASCII configuration files (specifically /etc/passwd).
> >>> title = getattr(MP3Info.MP3Info(open('file.mp3', 'rb')), 'title')
> >>> title
> 'K\xf6ttbullar i n\xe4san'
> >>> print title
> K?ttbullar i n?san
In this case, it appears that the title in the MP3 file is encoded in
Latin-1, not in UTF-8. Your terminal expects UTF-8. The data you print
are invalid UTF-8, so the terminal refuses to display them. To print
the data properly in your terminal, do
print unicode(title, "iso-8859-1").encode("utf-8")
Again, there is nothing that Python can do about that: It is not
possible to know what encoding title has - it could just as well be,
say, KOI8-R (in which case \xf6 would be CYRILLIC CAPITAL LETTER ZHE,
not LATIN SMALL LETTER O WITH DIAERESIS).
> Besides this stuff, I think it's really nice..
I find it a reasonable decision to suggest users to use UTF-8 as their
default encoding, irrespective of the language they
speak. Unfortunately, many applications are not really prepared for
multi-byte encodings, but those applications must be corrected.
Regards,
Martin
Regards,
Bengt Richter
I took a quick look at the ID3 specification reveals that Unicode was
not introduced until ID3v2, so determining the encoding before that was
not possible.
With ID3v2, the possible encodings are ISO-8859-1, UTF-16 + BOM,
UTF-16BE (no BOM) and UTF-8.
When reading ID3v2, MP3Info should present the title as a Unicode object
(though I have no idea if it actually does this or not).
When reading ID3v[0,1], MP3Info would have no choice but to present the
title as a byte string.
Cheers,
Brian