CNS 11643 data on Chinese Mac

34 views
Skip to first unread message

TenThousandThings

unread,
Feb 28, 2018, 3:22:23 PM2/28/18
to Chinese Mac
Hi, I'm back. I went a little crazy with the CNS data, laying the foundation for what will become a discussion of Taiwan's educational character sets, and how they relate to fonts and publishing. There are things I don't understand about variation selectors, compatibility ideographs, and how fonts work, so that will have to wait, but the pages for CNS 11643 Planes 1-7 are finished, at least for the 99.95% of it that is in Unicode:


These are not yet linked from the main site, as I haven't yet decided how to approach it. For the new site, really what I want to do is provide the four basic lists from Taiwan in their entirety:

常用國字標準字體表 (4,408 hanzi) common
次常用國字標準字體表 (6,341 hanzi) less common
罕用字體表 (18,480 hanzi) rare
異體國字字表 (18,609 hanzi) variants

These date to 1982-1984 (all of these are in CNS 11643 Planes 1-7 and Unicode), and I'm not sure if and how they have been updated. Handling the actual data is trivial, once you learn a little bit about Ruby (in my case) or any other regular-expression language. Ken Lunde provides the first two lists, but I haven't yet tried to find the data to generate the other two.

NOTE: If you are in Sierra or High Sierra, you'll see a blank glyph at CNS T3-272A (U+2F98F) on Plane 3 -- this is due to the Hiragino Sans CNS font, which is the default for that code point but doesn't actually have glyph for it. There are a lot of blank glyphs in that font from the CJK Compatibility Ideographs Supplement, but I think only one of them is on the first seven planes of CNS:


Curiously, Baoli TC does have a glyph for it in Sierra/High Sierra, but it gets bumped by the Hiragino Sans glyph, which macOS doesn't know is blank.

CNS Planes 10-14 are more problematic. They are probably best approached from the perspective of Unicode's source data, rather than CNS. Plane 15 has CNS characters not yet in Unicode. This includes a steady flow of new submissions used in names and places in Taiwan. A lot of these have to do with Hakka, Southern Min, and other languages. Maybe fun if you know them or need them for your research, but outside of my wheelhouse...

ER

Eric Rasmussen

unread,
Feb 28, 2018, 3:39:45 PM2/28/18
to Chinese Mac
On Wed, Feb 28, 2018 at 3:22 PM, TenThousandThings wrote:
CNS Planes 10-14 are more problematic. They are probably best approached from the perspective of Unicode's source data, rather than CNS. Plane 15 has CNS characters not yet in Unicode. This includes a steady flow of new submissions used in names and places in Taiwan. A lot of these have to do with Hakka, Southern Min, and other languages. Maybe fun if you know them or need them for your research, but outside of my wheelhouse...

Ugh, allow me to correct that:

[1] "CNS Plane 10-14 are more problematic" should read "CNS Planes 10-15 are more problematic."
[2] "Plane 15 has CNS characters not yet in Unicode" is a reference to Plane 15 of UNICODE, not Plane 15 of CNS. Fire that editor! Unicode Plane 15 is one of the Unicode Private Use Areas, and CNS maps its characters to there when they are not yet in Unicode proper...

There's a page on this, here:


But it's currently just a skeleton.

ER

Eric Rasmussen

unread,
Mar 2, 2018, 5:22:48 PM3/2/18
to Chinese Mac
On Wed, Feb 28, 2018 at 3:22 PM, TenThousandThings wrote:
[...] [EDITED] CNS Planes 10-15 are more problematic. They are probably best approached from the perspective of Unicode's source data, rather than CNS. Unicode's Supplementary Private Use Area has CNS characters not yet in Unicode. This includes a steady flow of new submissions used in names and places in Taiwan. A lot of these have to do with Hakka, Southern Min, and other languages. [...]

Okay, so I've taken that as far as I can. If you read carefully at the end of my CNS entry, you'll hear the result of a long, wasted day trying to make sense of the higher planes of CNS. Words like "not useful" and "tangled web" get used. The thing that needs to be done is to isolate the "TSource" hanzi in Unicode and then sort them from there. Unicode supplies the data to do that, but it's slightly harder than what I've been doing, which as I said, was trivial for someone with a decent text editor and a very basic knowledge of regular expressions.


I could not find datasets for the two more advanced hanzi lists 《罕用國字標準字體表》 and 《異體國字字表》 from the Ministry of Education on their site. You can search in them and use them, which is great and their site will be a featured part of the "Language Tools" page when I get around to updating that, but the raw data is not available as far as I can tell. Maybe I'm just missing it? So I think it has to be reverse engineered from Unicode and CNS, which is beyond what I can do. I'm like a dog with a bone, though, so I'll probably keep trying to find an easier way to get at it.

I think it would be great to be able to say, "This is what a serious font for Traditional-Chinese scholarly purposes should contain." The Unihan database provides a most of the source data to do that, but I thought the MoE lists would be useful...

ER
Reply all
Reply to author
Forward
0 new messages