Uncovered characters

Jan Eichhorn

unread,

Jul 5, 2012, 1:30:49 PM7/5/12

to kan...@googlegroups.com

I wrote a program to check which characters are currently covered by kanjivg.

Jouyou (taught-in-school-kanji): Only two characters are missing, they are

5861 塡 ("fill up")

9830 頰 ("cheek")

Jinmeiyou (used-for-names-kanji):

103 items are missing, the list is here:

https://gist.github.com/3054938

Characters/codepoints that are referenced by kanjivg element, original, phon, or type attributes, but do not have a kanjivg file of their own:

https://gist.github.com/3054901

Also not covered is the Unicode Kangxi radicals codeblock, the Kangxi strokes and nearly all radical supplements.

All JLPT kanji are covered, as well as all kanji that have a frequency ranking in kanjidic.

(http://www.csse.monash.edu.au/~jwb/kanjidic.html)

Did you know that...

The item with the most Bezier points is 03091 ゑ ("hiragana we", 46 points)

The item with most strokes is 09a6b 驫 ("pony farm", 30 strokes)

The item with most element groups is 097c8 韈 ("socks", 20 groups)

Greetings, Jan

Karl Rosvold

unread,

Jul 5, 2012, 10:30:53 PM7/5/12

to kan...@googlegroups.com

Hi Jan,

This is great. Thank you for finding this list and also pointing it out to the mailing list.

I think the reason why there are some characters missing is that the new list of Joyo Kanji was officially adopted in 2010, but Ulrich had already finished the basic data entry before then. Just from glancing at the characters you listed, I think they are all, or almost all variants of other characters which I'm quite certain already exist in KanjiVG. Unicode doesn't always have separate code points for kanji that have the same meaning, reading, etc. but are written slightly differently. For example 冷 -- the right side is written differently in mincho and kyokasho fonts, but there is only one code point. Japanese can sometimes be very stubborn about the exact way to write kanji for place- and personal names, so with the 2010 Joyo and slightly earlier Jinmei kanji revisions, things that had only one code point before were identified as officially recognized variants. Based on the fact that you were able to print these characters out, Unicode now has code points for them.

You already did a great job finding these missing characters, but would it also be possible to write some kind of code to find which characters they are variants of?

For example, by manually searching, I can find that:

5861 塡 ("fill up") is a variant of: 填 (586B)

9830 頰 ("cheek") is a variant of 頬 (982C)

Actually, looking at these two, it seems the encodings of the kanji which they are variants of are very close, so maybe a script is not necessary.

One more reason why those kanji were probably not included is that, if I remember correctly, KanjiVG started as being a project to cover JIS X0208 (the 'first' 6879 kanji) and at least for 'cheek' and 'fill up' the variants are in JIS X0213 but not JIS X0208.

Now that the characters are somehow more mainstream than they were before 2010, in my opinion it is appropriate to add them.

Karl

--
You received this message because you are subscribed to the "KanjiVG" group.
For options and unsubscribing, visit this group at
http://groups.google.com/group/kanjivg

msk...@ansuz.sooke.bc.ca

unread,

Jul 6, 2012, 2:26:21 AM7/6/12

to kan...@googlegroups.com

On Fri, 6 Jul 2012, Karl Rosvold wrote:
> For example, by manually searching, I can find that:
>
> 5861 塡 ("fill up") is a variant of: 填 (586B)
> 9830 頰 ("cheek") is a variant of 頬 (982C)
>
> Actually, looking at these two, it seems the encodings of the kanji which
> they are variants of are very close, so maybe a script is not necessary.

I think that in general we're unlikely to find that "missing" characters
are variants of characters with separate code points. See, for
instance, this photo I took in Maibara last Summer:
http://ansuz.sooke.bc.ca/gallery/index.php?display=2011-jtrip-17%2FP1000971.jpg

Out of four characters, three fail to match Unicode's Japanese example
glyphs:

湯 - nonstandard simplification. On the sign the upper right component
looks like 口, but Unicode's examples for all languages have it
looking like 日.
谷 - standard form is on the sign
神 - Unicode's Japanese example glyph has the small stroke at upper left
vertical, but the sign has it diagonal, typical of Chinese style.
社 - Unicode's Japanese example uses the newer form of the left-side
radical, which looks like 礻, but the sign uses the form that
looks like 示, typical of Korean style.

I don't think any of these have separate code points for the other forms.
In general, if a glyph really is a "variant" form of a character, then
under Unicode's unification policy it won't have a separate code point;
Unicode would only assign one if they made a mistake (which sometimes
happens) or if some other Unicode policy (in particular, round-trip
compatibility) took precedence over that one.

For the separate code point to be numerically close is even less likely.
The first block of Unified Han characters was assigned in Kang Xi radical
and stroke order, so semantically related characters may end up close if
they both happen to be in that block. But even that isn't guaranteed, and
if the characters are not both in the first-assigned block of the Unified
Han range, all bets are off. Rare "variant" characters, if they are
assigned separate code points at all, may tend to be added in later blocks
which would be very far away in code point space.

I don't think this is really a big problem. There's no particular reason
that KanjiVG needs a separate code point for every database entry.
Already we have multiple XML files for some code points; for instance,
there are three for U+4FDA 俚. In principle, there's no reason we
couldn't have a database entry for a kanji that had no code point at all.
We just need to have some other way of knowing what to name the file.

--
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/

Karl Rosvold

unread,

Jul 8, 2012, 5:04:15 AM7/8/12

to kan...@googlegroups.com

Matthew and Jan,

I went ahead and checked the 103 jinmeiyo kanji from Jan's list. It turns out that all but perhaps one are variants of kanji which are in the JIS X0208 standard. 99 out of 103 are in "JIS LEVEL 3, which is part of the JIS X0213 standard", 2 are part of a list called 補助漢字 according to JIS, 1 has another unicode variant closeby in the list of unicode enodings, and I wasn't able to confirm anything for sure about the last one.

Matthew said:

I think that in general we're unlikely to find that "missing" characters
are variants of characters with separate code points. See, for
instance, this photo I took in Maibara last Summer:
http://ansuz.sooke.bc.ca/gallery/index.php?display=2011-jtrip-17%2FP1000971.jpg

Out of four characters, three fail to match Unicode's Japanese example
glyphs:

湯 - nonstandard simplification. On the sign the upper right component
looks like 口, but Unicode's examples for all languages have it
looking like 日.

==> Right. I don't think there is any character set that has 口 at the top rather than 日. It doesn't look familiar, so this is probably an abrieviation akin to the kinds of stroke omissions made on highway signs. You can see examples of this phenomenon on this website, for example: http://portal.nifty.com/2009/04/10/a/

谷 - standard form is on the sign
神 - Unicode's Japanese example glyph has the small stroke at upper left
vertical, but the sign has it diagonal, typical of Chinese style.

==> This is the kind of difference that Japanese language teachers would mark wrong but most people wouldn't notice. But you're right. There aren't two JIS or Unicode points for this level of variation. That is a font thing as far as I know.

社 - Unicode's Japanese example uses the newer form of the left-side
radical, which looks like 礻, but the sign uses the form that
looks like 示, typical of Korean style.

==> Well, I don't know when the "F###" encodings were added to Unicode, but as you can see from this e-mail, 社(U+FA4C) and 社(U+793E) have different code points now. I have software from 2002 which had ALL but two of these variations listed, but only gave JIS encodings for the characters in Jan's list which have F### unicode code points. Therefore it is my guess that those were added sometime after 2002. It is even possible that they were added after those variants were officially added to the Jinmeiyo kanji list.

I don't think any of these have separate code points for the other forms.
In general, if a glyph really is a "variant" form of a character, then
under Unicode's unification policy it won't have a separate code point;

==> This is the Unicode policy as far as I know, but if Unicode refused to give separate code points to these variants officially recognized by Japanese law and Japanese Industrial Standards, Unicode would be incapable of handling all the distinctions Japanese people want to make. I don't have any references, but I suspect that Unicode violated their 'no separate code points for variants" policy to accommodate the new Japanese policy.

Unicode would only assign one if they made a mistake (which sometimes
happens) or if some other Unicode policy (in particular, round-trip
compatibility) took precedence over that one.

==> Again, I think if Unicode didn't accommodate these distinctions, it would have no chance to be adopted for use in computer encoding in Japan.

For the separate code point to be numerically close is even less likely.

==> For #1-46 out of 103 on Jan's list, it IS close. Please refer to the attached excel sheet.
==> For #47-#103, as mentioned above, I think Unicode added these later to accommodate the new Japanese standard (c. 2010)

I don't think this is really a big problem. There's no particular reason
that KanjiVG needs a separate code point for every database entry.

==> Personally, it would be nice to have a complete set of characters that are recognized as being 'standard' in Japanese law. That said, other than on signs like the one you saw in Maibara, I don't think they are used very much, and they all have variants. For the two in Joyo kanji however, I think they should be added for sure.

Already we have multiple XML files for some code points; for instance,
there are three for U+4FDA 俚. In principle, there's no reason we
couldn't have a database entry for a kanji that had no code point at all.
We just need to have some other way of knowing what to name the file.

==> I will defer to your expertise in that area.

Here is a little more background on these various character sets. It's not the complete picture by any means, but hopefully it will shed a little more light onto why this situation has occurred.

Karl

Jan and Matthew,

Here's a summary of the 103 'jinmeiyo' (人名用) kanji which Jan mentioned as missing from KanjiVG.
For an overview of these kanji, see:
http://en.wikipedia.org/wiki/Jinmeiy%C5%8D_kanji

I suspect the reasons these characters were not included in KanjiVG is as follows:
KanjiVG is primarily a Japanese project, and the JIS X 0208 standard was used from which to draw characters from, not Unicode, which includes thousands or tens of thousands of characters which are not commonly used in Japanese. JIS X 0208 includes "Level 1 & Level 2" JIS characters (a total of 6879 characters), whereas JIS X 0213 includes "Levels 1-4" for a total of 11,223 characters. For a description of the JIS kanji code standard, see:
http://ja.wikipedia.org/wiki/JIS%E6%BC%A2%E5%AD%97%E3%82%B3%E3%83%BC%E3%83%89

KanjiVG data, as far as I know, was being entered around 2007 +/- a couple of years. At that time there were only 285 Jinmeiyo kanji. I haven't explicitly checked, but I think all or almost all of them are included in JIX X 0208.

In September 2004, 484 new characters and 209 variant forms of joyo kanji (the 1981 standard) were added, leaving 983 Jinmeiyou kanji.

On November 30, 2010, the Japanese government added 196 characters to the Joyo kanji list, as well as shifting some form Jinmeiyo kanji to the Joyo kanji list.

Unfortunately for computer users, the government's recognition of kanji variants went against Unicode's policy of allowing font makers to take care of variants and providing only one code point to cover all variants. This policy is understandable, however, since it tries to accomodate Japanese people's insistance that kanji for their names or for place names be written one particular way.

Of the 103 kanji you reported as being included in the (new) Jinmeiyo kanji but missing from KanjiVG, I verified that 99 were included in JIS X0213 but not in JIS X0208. This explains why they were not included in KanjiVG.

For the remaining 4, U+7626 and U+7E6B were included in a list refered to as "補助漢字" recoginized by JIS, but not included in either the JIS X0208 or JIS X0208 standards.
Unicode U+4FF1 does not appear to be recognized in any JIS standard, although it's JIS X0208 variant (U+5036) does exist in both unicode and JIS.
Finally U+541E is very similar in apparance to U+5451, which is a JIS X0208 character, but I couldn't find anything that indicated that they are variants of each other, or if they were completely different characters.

By the way, the glyphs for all kanji in your list with unicode codepoints startin with F### were not correct -- or at least they were somehow transformed to their JIS X0208 equivalants when I copied them from the link on github. I was able to successfully copy and paste the characters from http://www.cojak.org/index.php?term=F9D0&function=code_lookup by entering the unicode code that you listed. The glyph you listed made it easy for me to find the JIS X0208 versions of the characters.

I still think it's good that you found these, and I think they should be added to KanjiVG if possible. I hope this sheds a bit of light as to what those characters are, and why they were not included in KanjiVG from the beginning. Also, in my experience, many software packages seem to have difficulty handling characters that are outside the JIS X0208 character set, although I seem to be able to display everything in Microsoft Excel now.

Karl

Karl Rosvold

unread,

Jul 8, 2012, 5:05:15 AM7/8/12

to kan...@googlegroups.com

Here's the Excel file.

KanjiVG missing Jinmeiyo kanji.xlsx

msk...@ansuz.sooke.bc.ca

unread,

Jul 8, 2012, 8:49:05 AM7/8/12

to kan...@googlegroups.com

On Sun, 8 Jul 2012, Karl Rosvold wrote:
> ==> This is the kind of difference that Japanese language teachers would
> mark wrong but most people wouldn't notice. But you're right. There aren't
> two JIS or Unicode points for this level of variation. That is a font thing
> as far as I know.

The question is how deeply we want to capture this kind of varation in
KanjiVG. I think it's likely that in many cases, we *will* want to
capture variations that Unicode doesn't.

> ==> Well, I don't know when the "F###" encodings were added to Unicode, but

> as you can see from this e-mail, 社 (U+FA4C) and 社(U+793E) have different

> code points now. I have software from 2002 which had ALL but two of these

Interesting. I was aware of the existence of the "Compatibility
Ideographs" block starting at U+F900, but hadn't realized it was relevant
to this case. It looks like U+FA4C was added specifically to capture this
distinction for compatibility with the Japanese standards that give the
two character forms (which Unicode would otherwise unify) separate code
points. For the reasons you describe, Unicode must permit separating them
now that the Japanese government does; that is Unicode's own policy,
round-trip compatibility trumps unification; but it becomes complicated
when a change in an external standard forces Unicode to un-unify two
characters it had formerly unified.

Saying they "have different code points now" may be overly simplistic. We
have the code point U+FA4C specifically for use in systems that require
drawing this distinction for compatibility with Japanese government
standards, but U+793E remains the standard code point for this character
elsewhere regardless of its form! That's what having a "compatibility"
code point means. For instance, the Koreans are not going to change their
documents and fonts to use U+FA4C instead of U+793E; Korean fonts will
still have U+793E looking like the Japanese version of U+FA4C rather than
like the Japanese version of U+793E.

> any references, but I suspect that Unicode violated their 'no separate code
> points for variants" policy to accommodate the new Japanese policy.

I think they not so much violated it as they have another policy
(round-trip compatibility) that takes precedence over unification.

Karl Rosvold

unread,

Jul 8, 2012, 9:05:06 AM7/8/12

to kan...@googlegroups.com, kan...@googlegroups.com

Hi Matthew.

I see. I didn't choose the best words in my original message, but I agree with everything you say. That's an interesting and accurate point you make about the Korean version of 社。

Karl

Jan Eichhorn

unread,

Jul 8, 2012, 2:15:49 PM7/8/12

to kan...@googlegroups.com

Hi,

I checked the coverage of JIS X0208. Kanjivg covers indeed all 6355 kanji + the kana. Not covered are the latin/greek/cyrillic alphabets and the symbols. The characters belonging exclusively to JIS 212/213 are almost entirely uncovered. The exceptions are:

(only found in 212)

5F50 彐

6220 戠

(only found in 213)

342C 㐬

34C1 㓁

4EBB 亻

590D 复

5C03 尃

8002 耂

8279 艹

98E0 飠

9EC3 黃

9ED1 黑

200A4 𠂤

26951 𦥑

Since I had a list of most frequent Chinese characters, I checked that as well. From the 3000 most frequent simplified characters, 1006 are missing in kanjivg. If they are converted to traditional characters, only 242 are missing.

> By the way, the glyphs for all kanji in your list with unicode codepoints startin with F### were not correct

They look okay in my output but wrong in the gist.

I think the addition of radicals and strokes is most important. That also covers most of the missing references in the kanjivg attributes and would make kanjivg self-contained.

Jan

msk...@ansuz.sooke.bc.ca

unread,

Jul 8, 2012, 2:51:52 PM7/8/12

to kan...@googlegroups.com

On Sun, 8 Jul 2012, Jan Eichhorn wrote:
> > By the way, the glyphs for all kanji in your list with unicode codepoints
> startin with F### were not correct
>
> They look okay in my output but wrong in the gist.

The font installed in my XFCE Terminal program on a current installation
of Arch Linux displays glyphs that don't agree with the current Unicode
charts. For instance, it displays U+FA4C 社 as something close to
[lr]王豕, but with an extra short crossing stroke overlayed on 豕. That
seems to be a glyph for Unicode's U+FA4A instead of U+FA4C. My guess is
that this font may have been designed for some kind of proposal of
compatibility code points before they actually became part of Unicode. I
don't know what font it is, but if it's a popular free one, it may be the
same one you're seeing in the "gist" on Github.

Reply all

Reply to author

Forward