regenerating unicodedata for py2.7 using py3 makeunicodedata.py?

Vlastimil Brom

unread,

Nov 13, 2010, 12:55:35 PM11/13/10

to python

Hi all,
I'd like to ask about a surprising possibility I found while
investigating the new unicode 6.0 standard for use in python.
As python 2 series won't be updated in this regard
( http://bugs.python.org/issue10400 ),
I tried my "poor man's approach" of compiling the needed pyd file with
the recent unicode data (cf. the older post
http://mail.python.org/pipermail/python-list/2010-March/1240002.html )
While checking the changed format, i found to my big surprise, that it
is possible to generate the header files using the py3
makeunicodedata.py
which has already been updated for Unicode 6.0; this is even much more
comfortable than the previous versions, as the needed data are
downloaded automatically.
http://svn.python.org/view/python/branches/py3k/Tools/unicode/makeunicodedata.py?view=markup&pathrev=85371
It turned out, that the resulting headers are accepted by MS Visual
C++ Express along with the py2.7 source files
and that the generated unicodedata.pyd seems to be working work at
least in the cases I tested sofar.

Is this intended or even guaranteed for these generated files to be
compatible across py2.7 and py3, or am I going to be bitten by some
less obvious issues later?

The newly added ranges and characters are available, only in the CJK
Unified Ideographs Extension D the character names are not present
(while categories are), but this appears to be the same in the
original unicodedadata with 5.2 on CJK Unified Ideographs Extension C.

>>> unicodedata.unidata_version
'6.0.0'
>>> unicodedata.name(u"\U0002B740") # 0x2B740-0x2B81F; CJK Unified Ideographs Extension D # unicode 6.0 addition
Traceback (most recent call last):
File "<input>", line 1, in <module>
ValueError: no such name
>>> unicodedata.category(u"\U0002B740")
'Lo'
>>>

###########################

>>> unicodedata.unidata_version
'5.2.0'
>>> unicodedata.name(u"\U0002A700") # 0x2A700-0x2B73F; CJK Unified Ideographs Extension C
Traceback (most recent call last):
File "<input>", line 1, in <module>
ValueError: no such name
>>> unicodedata.category(u"\U0002A700")
'Lo'
>>>

Could please anybody confirm, whether this way of updating the
unicodedata for 2.7 is generaly viable or point out possible problem
this may lead to?
Many thanks in advance,
Vlastimil Brom

Martin v. Loewis

unread,

Nov 13, 2010, 5:40:34 PM11/13/10

to

> Is this intended or even guaranteed for these generated files to be
> compatible across py2.7 and py3, or am I going to be bitten by some
> less obvious issues later?

It works because the generated files are just arrays of structures,
and these structures are the same in 2.7 and 3.2. However, there is
no guarantee about this property: you will need to check for changes
to unicodedata.c to see whether they may affect compatibility.

Regards,
Martin

Vlastimil Brom

unread,

Nov 13, 2010, 7:10:47 PM11/13/10

to pytho...@python.org

2010/11/13 Martin v. Loewis <mar...@v.loewis.de>:

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Thanks for the confirmation Martin!

Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?

Regards,
Vlastimil Brom

Martin v. Loewis

unread,

Nov 18, 2010, 12:45:55 PM11/18/10

to Vlastimil Brom, pytho...@python.org

> Thanks for the confirmation Martin!
>
> Do you think, it the mentioned omission of the character names of some
> CJK ranges in unicodedata intended, or should it be reported to the
> tracker?

It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.

Regards,
Martin

Martin v. Loewis

unread,

Nov 18, 2010, 12:45:55 PM11/18/10

to Vlastimil Brom, pytho...@python.org

> Thanks for the confirmation Martin!
>
> Do you think, it the mentioned omission of the character names of some
> CJK ranges in unicodedata intended, or should it be reported to the
> tracker?

It's certainly a bug. So a bug report would be appreciated, but much

Vlastimil Brom

unread,

Nov 19, 2010, 9:49:21 AM11/19/10

to pytho...@python.org

2010/11/18 Martin v. Loewis <mar...@v.loewis.de>:

Thanks,
I just created a bug ticket:
http://bugs.python.org/issue10459

The omissions of character names seem to be:

龼 (0x9fbc) - 鿋 (0x9fcb)
(CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])

𪜀 (0x2a700) - 𫜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])

𫝀 (0x2b740) - 𫠝 (0x2b81d)
(CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])

(Also the unprintable ASCII controls, Surrogates and Private use area,
where the missing names are probably ok.)

Unfortunately, I am not able to provide a patch, mainly because of
unicodadate being C code.
A while ago I considered writing some unicodedata enhancements in
python, which would support the ranges and script names, full category
names etc., but sofar the direct programatic lookups in the online
unicode docs and with some simple processing also do work
sufficiently...

Regards,
Vlastimil Brom