Is this intended or even guaranteed for these generated files to be
compatible across py2.7 and py3, or am I going to be bitten by some
less obvious issues later?
The newly added ranges and characters are available, only in the CJK
Unified Ideographs Extension D the character names are not present
(while categories are), but this appears to be the same in the
original unicodedadata with 5.2 on CJK Unified Ideographs Extension C.
>>> unicodedata.unidata_version
'6.0.0'
>>> unicodedata.name(u"\U0002B740") # 0x2B740-0x2B81F; CJK Unified Ideographs Extension D # unicode 6.0 addition
Traceback (most recent call last):
File "<input>", line 1, in <module>
ValueError: no such name
>>> unicodedata.category(u"\U0002B740")
'Lo'
>>>
###########################
>>> unicodedata.unidata_version
'5.2.0'
>>> unicodedata.name(u"\U0002A700") # 0x2A700-0x2B73F; CJK Unified Ideographs Extension C
Traceback (most recent call last):
File "<input>", line 1, in <module>
ValueError: no such name
>>> unicodedata.category(u"\U0002A700")
'Lo'
>>>
Could please anybody confirm, whether this way of updating the
unicodedata for 2.7 is generaly viable or point out possible problem
this may lead to?
Many thanks in advance,
Vlastimil Brom
It works because the generated files are just arrays of structures,
and these structures are the same in 2.7 and 3.2. However, there is
no guarantee about this property: you will need to check for changes
to unicodedata.c to see whether they may affect compatibility.
Regards,
Martin
Thanks for the confirmation Martin!
Do you think, it the mentioned omission of the character names of some
CJK ranges in unicodedata intended, or should it be reported to the
tracker?
Regards,
Vlastimil Brom
It's certainly a bug. So a bug report would be appreciated, but much
more so a patch. Ideally, the patch would either be completely
forward-compatible (should the CJK ranges change in future Unicode
versions),
or at least have a safe-guard to detect that the data file is getting
out of sync with the C implementation.
Regards,
Martin
It's certainly a bug. So a bug report would be appreciated, but much
The omissions of character names seem to be:
龼 (0x9fbc) - 鿋 (0x9fcb)
(CJK Unified Ideographs [19968-40959] [0x4e00-0x9fff])
𪜀 (0x2a700) - 𫜴 (0x2b734)
(CJK Unified Ideographs Extension C [173824-177983] [0x2a700-0x2b73f])
𫝀 (0x2b740) - 𫠝 (0x2b81d)
(CJK Unified Ideographs Extension D [177984-178207] [0x2b740-0x2b81f])
(Also the unprintable ASCII controls, Surrogates and Private use area,
where the missing names are probably ok.)
Unfortunately, I am not able to provide a patch, mainly because of
unicodadate being C code.
A while ago I considered writing some unicodedata enhancements in
python, which would support the ranges and script names, full category
names etc., but sofar the direct programatic lookups in the online
unicode docs and with some simple processing also do work
sufficiently...
Regards,
Vlastimil Brom