So I wrote the following script :
#!/usr/bin/env python
"""Example of use of the unicodedata module
http://docs.python.org/lib/module-unicodedata.html
"""
import unicodedata
import sys
# outcodec = 'latin_1'
outcodec = 'iso8859_15'
if len(sys.argv) > 1:
outcodec = sys.argv[1]
for c in range(256):
uc = unichr(c)
uname = unicodedata.name(uc, None)
if uname:
unfd = unicodedata.normalize('NFD', uc).encode(outcodec,
'replace')
unfc = unicodedata.normalize('NFC', uc).encode(outcodec,
'replace')
print str(c).ljust(3), uname.ljust(42), unfd.ljust(2),
unfc.ljust(2), \
unicodedata.category(uc), unicodedata.numeric(uc, None)
and here are some samples of output
44 COMMA , , Po None
45 HYPHEN-MINUS - - Pd None
46 FULL STOP . . Po None
47 SOLIDUS / / Po None
48 DIGIT ZERO 0 0 Nd 0.0
49 DIGIT ONE 1 1 Nd 1.0
50 DIGIT TWO 2 2 Nd 2.0
It seems that 'Nd' category means Numerical digit doh!
64 COMMERCIAL AT @ @ Po None
65 LATIN CAPITAL LETTER A A A Lu None
66 LATIN CAPITAL LETTER B B B Lu None
'Lu' should read 'Letter upper' ?
94 CIRCUMFLEX ACCENT ^ ^ Sk None
95 LOW LINE _ _ Pc None
96 GRAVE ACCENT ` ` Sk None
97 LATIN SMALL LETTER A a a Ll None
98 LATIN SMALL LETTER B b b Ll None
'Ll' == Letter lower
124 VERTICAL LINE | | Sm None
125 RIGHT CURLY BRACKET } } Pe None
126 TILDE ~ ~ Sm None
160 NO-BREAK SPACE Zs None
161 INVERTED EXCLAMATION MARK ¡ ¡ Po None
What a gap !
245 LATIN SMALL LETTER O WITH TILDE o? õ Ll None
246 LATIN SMALL LETTER O WITH DIAERESIS o? ö Ll None
247 DIVISION SIGN ÷ ÷ Sm None
248 LATIN SMALL LETTER O WITH STROKE ø ø Ll None
'Sm' should read 'sign mathematics' ?
I think that such code snippets should be included in the documentation
or in a Wiki.
Regards
Sorry for bad english, I'm not a native speaker.
>Python has a very good support of unicode, utf8, encodings ... But I
>have some difficulties with the concepts and the vocabulary.
You're not alone there. But I don't expect the docs for the Python
implementation of Unicode to explain the concepts and vocabulary of
Unicode. That's the job of the Unicode consortium, and they do a
not-unreasonable job of it; see www.unicode.org and in particular
http://www.unicode.org/Public/UNIDATA/UCD.html
explains all the things that the Python unicodedata module is
implementing.
> The
>documentation is not bad, but for example in reading
>http://docs.python.org/lib/module-unicodedata.html
>I had a long time to figure out what unicodedata.digit(unichr) would
>mean, a simple example is badly lacking.
>
>So I wrote the following script :
>
[snip]
>
>I think that such code snippets should be included in the documentation
>or in a Wiki.
>
Any effort should be directed (IMESHO) towards (a) keeping the URL in
the Python documentation up-to-date [it's not] (b) using the *LATEST*
version of the ucd file when each version of Python is released [still
stuck on 3.2.0 when the current version available from Unicode.org is
4.1.0]
[Exit, pursued by a bear.]
[Noises off.]
OK OK don't hit me, Martin, how about instructions on how to DIY,
then?
Cheers,
John