Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

trying to understand unicode

19 views

Skip to first unread message

F. Petitjean

unread,

Apr 20, 2005, 6:58:35 AM4/20/05

Python has a very good support of unicode, utf8, encodings ... But I
have some difficulties with the concepts and the vocabulary. The
documentation is not bad, but for example in reading
http://docs.python.org/lib/module-unicodedata.html
I had a long time to figure out what unicodedata.digit(unichr) would
mean, a simple example is badly lacking.

So I wrote the following script :

#!/usr/bin/env python

"""Example of use of the unicodedata module
http://docs.python.org/lib/module-unicodedata.html
"""

import unicodedata
import sys

# outcodec = 'latin_1'
outcodec = 'iso8859_15'
if len(sys.argv) > 1:
outcodec = sys.argv[1]

for c in range(256):
uc = unichr(c)
uname = unicodedata.name(uc, None)
if uname:
unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
'replace')
unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
'replace')
print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
unfc.ljust(2), \
unicodedata.category(uc), unicodedata.numeric(uc, None)

and here are some samples of output
44 COMMA , , Po None
45 HYPHEN-MINUS - - Pd None
46 FULL STOP . . Po None
47 SOLIDUS / / Po None
48 DIGIT ZERO 0 0 Nd 0.0
49 DIGIT ONE 1 1 Nd 1.0
50 DIGIT TWO 2 2 Nd 2.0

It seems that 'Nd' category means Numerical digit doh!

64 COMMERCIAL AT @ @ Po None
65 LATIN CAPITAL LETTER A A A Lu None
66 LATIN CAPITAL LETTER B B B Lu None

'Lu' should read 'Letter upper' ?

94 CIRCUMFLEX ACCENT ^ ^ Sk None
95 LOW LINE _ _ Pc None
96 GRAVE ACCENT ` ` Sk None
97 LATIN SMALL LETTER A a a Ll None
98 LATIN SMALL LETTER B b b Ll None
'Ll' == Letter lower

124 VERTICAL LINE | | Sm None
125 RIGHT CURLY BRACKET } } Pe None
126 TILDE ~ ~ Sm None
160 NO-BREAK SPACE Zs None
161 INVERTED EXCLAMATION MARK ¡ ¡ Po None

What a gap !

245 LATIN SMALL LETTER O WITH TILDE o? õ Ll None
246 LATIN SMALL LETTER O WITH DIAERESIS o? ö Ll None
247 DIVISION SIGN ÷ ÷ Sm None
248 LATIN SMALL LETTER O WITH STROKE ø ø Ll None

'Sm' should read 'sign mathematics' ?

I think that such code snippets should be included in the documentation
or in a Wiki.

Regards

Sorry for bad english, I'm not a native speaker.

John Machin

unread,

Apr 20, 2005, 8:01:56 AM4/20/05

On 20 Apr 2005 10:58:35 GMT, "F. Petitjean"
<little...@news.free.fr> wrote:

>Python has a very good support of unicode, utf8, encodings ... But I
>have some difficulties with the concepts and the vocabulary.

You're not alone there. But I don't expect the docs for the Python
implementation of Unicode to explain the concepts and vocabulary of
Unicode. That's the job of the Unicode consortium, and they do a
not-unreasonable job of it; see www.unicode.org and in particular

http://www.unicode.org/Public/UNIDATA/UCD.html

explains all the things that the Python unicodedata module is
implementing.

> The
>documentation is not bad, but for example in reading
>http://docs.python.org/lib/module-unicodedata.html
>I had a long time to figure out what unicodedata.digit(unichr) would
>mean, a simple example is badly lacking.
>
>So I wrote the following script :
>

[snip]
>

>I think that such code snippets should be included in the documentation
>or in a Wiki.
>

Any effort should be directed (IMESHO) towards (a) keeping the URL in
the Python documentation up-to-date [it's not] (b) using the *LATEST*
version of the ucd file when each version of Python is released [still
stuck on 3.2.0 when the current version available from Unicode.org is
4.1.0]

[Exit, pursued by a bear.]
[Noises off.]

OK OK don't hit me, Martin, how about instructions on how to DIY,
then?

Cheers,
John

0 new messages