Matthew and Jan,
I went ahead and checked the 103 jinmeiyo kanji from Jan's list. It turns
out that all but perhaps one are variants of kanji which are in the JIS
X0208 standard. 99 out of 103 are in "JIS LEVEL 3, which is part of the JIS
X0213 standard", 2 are part of a list called 補助漢字 according to JIS, 1 has
another unicode variant closeby in the list of unicode enodings, and I
wasn't able to confirm anything for sure about the last one.
I think that in general we're unlikely to find that "missing" characters
> are variants of characters with separate code points. See, for
> instance, this photo I took in Maibara last Summer:
> Out of four characters, three fail to match Unicode's Japanese example
> 湯 - nonstandard simplification. On the sign the upper right component
> looks like 口, but Unicode's examples for all languages have it
> looking like 日.
==> Right. I don't think there is any character set that has 口 at the top
rather than 日. It doesn't look familiar, so this is probably an
abrieviation akin to the kinds of stroke omissions made on highway signs.
You can see examples of this phenomenon on this website, for example:
> 谷 - standard form is on the sign
> 神 - Unicode's Japanese example glyph has the small stroke at upper left
> vertical, but the sign has it diagonal, typical of Chinese style.
==> This is the kind of difference that Japanese language teachers would
mark wrong but most people wouldn't notice. But you're right. There aren't
two JIS or Unicode points for this level of variation. That is a font thing
as far as I know.
> 社 - Unicode's Japanese example uses the newer form of the left-side
> radical, which looks like 礻, but the sign uses the form that
> looks like 示, typical of Korean style.
==> Well, I don't know when the "F###" encodings were added to Unicode, but
as you can see from this e-mail, 社(U+FA4C) and 社(U+793E) have different
code points now. I have software from 2002 which had ALL but two of these
variations listed, but only gave JIS encodings for the characters in Jan's
list which have F### unicode code points. Therefore it is my guess that
those were added sometime after 2002. It is even possible that they were
added after those variants were officially added to the Jinmeiyo kanji
> I don't think any of these have separate code points for the other forms.
> In general, if a glyph really is a "variant" form of a character, then
> under Unicode's unification policy it won't have a separate code point;
==> This is the Unicode policy as far as I know, but if Unicode refused to
give separate code points to these variants officially recognized by
Japanese law and Japanese Industrial Standards, Unicode would be incapable
of handling all the distinctions Japanese people want to make. I don't have
any references, but I suspect that Unicode violated their 'no separate code
points for variants" policy to accommodate the new Japanese policy.
Unicode would only assign one if they made a mistake (which sometimes
> happens) or if some other Unicode policy (in particular, round-trip
> compatibility) took precedence over that one.
==> Again, I think if Unicode didn't accommodate these distinctions, it
would have no chance to be adopted for use in computer encoding in Japan.
> For the separate code point to be numerically close is even less likely.
==> For #1-46 out of 103 on Jan's list, it IS close. Please refer to the
attached excel sheet.
==> For #47-#103, as mentioned above, I think Unicode added these later to
accommodate the new Japanese standard (c. 2010)
> I don't think this is really a big problem. There's no particular reason
> that KanjiVG needs a separate code point for every database entry.
==> Personally, it would be nice to have a complete set of characters that
are recognized as being 'standard' in Japanese law. That said, other than
on signs like the one you saw in Maibara, I don't think they are used very
much, and they all have variants. For the two in Joyo kanji however, I
think they should be added for sure.
Already we have multiple XML files for some code points; for instance,
> there are three for U+4FDA 俚. In principle, there's no reason we
> couldn't have a database entry for a kanji that had no code point at all.
> We just need to have some other way of knowing what to name the file.
==> I will defer to your expertise in that area.
Here is a little more background on these various character sets. It's not
the complete picture by any means, but hopefully it will shed a little more
light onto why this situation has occurred.
Jan and Matthew,
Here's a summary of the 103 'jinmeiyo' (人名用) kanji which Jan mentioned as
missing from KanjiVG.
For an overview of these kanji, see:
I suspect the reasons these characters were not included in KanjiVG is as
KanjiVG is primarily a Japanese project, and the JIS X 0208 standard was
used from which to draw characters from, not Unicode, which includes
thousands or tens of thousands of characters which are not commonly used in
Japanese. JIS X 0208 includes "Level 1 & Level 2" JIS characters (a total
of 6879 characters), whereas JIS X 0213 includes "Levels 1-4" for a total
of 11,223 characters. For a description of the JIS kanji code standard, see:
KanjiVG data, as far as I know, was being entered around 2007 +/- a couple
of years. At that time there were only 285 Jinmeiyo kanji. I haven't
explicitly checked, but I think all or almost all of them are included in
JIX X 0208.
In September 2004, 484 new characters and 209 variant forms of joyo kanji
(the 1981 standard) were added, leaving 983 Jinmeiyou kanji.
On November 30, 2010, the Japanese government added 196 characters to the
Joyo kanji list, as well as shifting some form Jinmeiyo kanji to the Joyo
Unfortunately for computer users, the government's recognition of kanji
variants went against Unicode's policy of allowing font makers to take care
of variants and providing only one code point to cover all variants. This
policy is understandable, however, since it tries to accomodate Japanese
people's insistance that kanji for their names or for place names be
written one particular way.
Of the 103 kanji you reported as being included in the (new) Jinmeiyo kanji
but missing from KanjiVG, I verified that 99 were included in JIS X0213 but
not in JIS X0208. This explains why they were not included in KanjiVG.
For the remaining 4, U+7626 and U+7E6B were included in a list refered to
as "補助漢字" recoginized by JIS, but not included in either the JIS X0208 or
JIS X0208 standards.
Unicode U+4FF1 does not appear to be recognized in any JIS standard,
although it's JIS X0208 variant (U+5036) does exist in both unicode and JIS.
Finally U+541E is very similar in apparance to U+5451, which is a JIS X0208
character, but I couldn't find anything that indicated that they are
variants of each other, or if they were completely different characters.
By the way, the glyphs for all kanji in your list with unicode codepoints
startin with F### were not correct -- or at least they were somehow
transformed to their JIS X0208 equivalants when I copied them from the link
on github. I was able to successfully copy and paste the characters from
http://www.cojak.org/index.php?term=F9D0&function=code_lookup by entering
the unicode code that you listed. The glyph you listed made it easy for me
to find the JIS X0208 versions of the characters.
I still think it's good that you found these, and I think they should be
added to KanjiVG if possible. I hope this sheds a bit of light as to what
those characters are, and why they were not included in KanjiVG from the
beginning. Also, in my experience, many software packages seem to have
difficulty handling characters that are outside the JIS X0208 character
set, although I seem to be able to display everything in Microsoft Excel