Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Uncovered characters
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  9 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Jan Eichhorn  
View profile  
 More options Jul 5 2012, 1:30 pm
From: Jan Eichhorn <jan.philipp.eichh...@googlemail.com>
Date: Thu, 5 Jul 2012 10:30:49 -0700 (PDT)
Local: Thurs, Jul 5 2012 1:30 pm
Subject: Uncovered characters

I wrote a program to check which characters are currently covered by
kanjivg.

Jouyou (taught-in-school-kanji): Only two characters are missing, they are
5861 塡 ("fill up")
9830 頰 ("cheek")

Jinmeiyou (used-for-names-kanji):
103 items are missing, the list is here:
https://gist.github.com/3054938

Characters/codepoints that are referenced by kanjivg element, original,
phon, or type attributes, but do not have a kanjivg file of their own:
https://gist.github.com/3054901

Also not covered is the Unicode Kangxi radicals codeblock, the Kangxi
strokes and nearly all radical supplements.
All JLPT kanji are covered, as well as all kanji that have a frequency
ranking in kanjidic.
(http://www.csse.monash.edu.au/~jwb/kanjidic.html)

Did you know that...
The item with the most Bezier points is 03091 ゑ ("hiragana we", 46 points)
The item with most strokes is 09a6b 驫 ("pony farm", 30 strokes)
The item with most element groups is 097c8 韈 ("socks", 20 groups)

Greetings, Jan


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Karl Rosvold  
View profile   Translate to Translated (View Original)
 More options Jul 5 2012, 10:30 pm
From: Karl Rosvold <karl.rosv...@gmail.com>
Date: Fri, 6 Jul 2012 11:30:53 +0900
Local: Thurs, Jul 5 2012 10:30 pm
Subject: Re: [kanjivg] Uncovered characters

Hi Jan,

This is great. Thank you for finding this list and also pointing it out to
the mailing list.

I think the reason why there are some characters missing is that the new
list of Joyo Kanji was officially adopted in 2010, but Ulrich had already
finished the basic data entry before then. Just from glancing at the
characters you listed, I think they are all, or almost all variants of
other characters which I'm quite certain already exist in KanjiVG. Unicode
doesn't always have separate code points for kanji that have the same
meaning, reading, etc. but are written slightly differently. For example 冷
-- the right side is written differently in mincho and kyokasho fonts, but
there is only one code point. Japanese can sometimes be very stubborn about
the exact way to write kanji for place- and personal names, so with the
2010 Joyo and slightly earlier Jinmei kanji revisions, things that had only
one code point before were identified as officially recognized variants.
Based on the fact that you were able to print these characters out, Unicode
now has code points for them.

You already did a great job finding these missing characters, but would it
also be possible to write some kind of code to find which characters they
are variants of?

For example, by manually searching, I can find that:

5861 塡 ("fill up")      is a variant of: 填 (586B)
9830 頰 ("cheek")    is a variant of 頬 (982C)

Actually, looking at these two, it seems the encodings of the kanji which
they are variants of are very close, so maybe a script is not necessary.

One more reason why those kanji were probably not included is that, if I
remember correctly, KanjiVG started as being a project to cover JIS X0208
(the 'first' 6879 kanji) and at least for 'cheek' and 'fill up' the
variants are in JIS X0213 but not JIS X0208.

Now that the characters are somehow more mainstream than they were before
2010, in my opinion it is appropriate to add them.

Karl

On Fri, Jul 6, 2012 at 2:30 AM, Jan Eichhorn <


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
msk...@ansuz.sooke.bc.ca  
View profile  
 More options Jul 6 2012, 2:26 am
From: msk...@ansuz.sooke.bc.ca
Date: Fri, 6 Jul 2012 01:26:21 -0500 (CDT)
Local: Fri, Jul 6 2012 2:26 am
Subject: Re: [kanjivg] Uncovered characters

On Fri, 6 Jul 2012, Karl Rosvold wrote:
> For example, by manually searching, I can find that:

> 5861 塡 ("fill up")      is a variant of: 填 (586B)
> 9830 頰 ("cheek")    is a variant of 頬 (982C)

> Actually, looking at these two, it seems the encodings of the kanji which
> they are variants of are very close, so maybe a script is not necessary.

I think that in general we're unlikely to find that "missing" characters
are variants of characters with separate code points.  See, for
instance, this photo I took in Maibara last Summer:
  http://ansuz.sooke.bc.ca/gallery/index.php?display=2011-jtrip-17%2FP1...

Out of four characters, three fail to match Unicode's Japanese example
glyphs:

  湯 - nonstandard simplification.  On the sign the upper right component
       looks like 口, but Unicode's examples for all languages have it
       looking like 日.
  谷 - standard form is on the sign
  神 - Unicode's Japanese example glyph has the small stroke at upper left
       vertical, but the sign has it diagonal, typical of Chinese style.
  社 - Unicode's Japanese example uses the newer form of the left-side
       radical, which looks like 礻, but the sign uses the form that
       looks like 示, typical of Korean style.

I don't think any of these have separate code points for the other forms.
In general, if a glyph really is a "variant" form of a character, then
under Unicode's unification policy it won't have a separate code point;
Unicode would only assign one if they made a mistake (which sometimes
happens) or if some other Unicode policy (in particular, round-trip
compatibility) took precedence over that one.

For the separate code point to be numerically close is even less likely.
The first block of Unified Han characters was assigned in Kang Xi radical
and stroke order, so semantically related characters may end up close if
they both happen to be in that block.  But even that isn't guaranteed, and
if the characters are not both in the first-assigned block of the Unified
Han range, all bets are off.  Rare "variant" characters, if they are
assigned separate code points at all, may tend to be added in later blocks
which would be very far away in code point space.

I don't think this is really a big problem.  There's no particular reason
that KanjiVG needs a separate code point for every database entry.
Already we have multiple XML files for some code points; for instance,
there are three for U+4FDA 俚.  In principle, there's no reason we
couldn't have a database entry for a kanji that had no code point at all.
We just need to have some other way of knowing what to name the file.

--
Matthew Skala
msk...@ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Karl Rosvold  
View profile   Translate to Translated (View Original)
 More options Jul 8 2012, 5:04 am
From: Karl Rosvold <karl.rosv...@gmail.com>
Date: Sun, 8 Jul 2012 18:04:15 +0900
Local: Sun, Jul 8 2012 5:04 am
Subject: Re: [kanjivg] Uncovered characters

Matthew and Jan,

I went ahead and checked the 103 jinmeiyo kanji from Jan's list. It turns
out that all but perhaps one are variants of kanji which are in the JIS
X0208 standard. 99 out of 103 are in "JIS LEVEL 3, which is part of the JIS
X0213 standard", 2 are part of a list called 補助漢字 according to JIS, 1 has
another unicode variant closeby in the list of unicode enodings, and I
wasn't able to confirm anything for sure about the last one.

Matthew said:

I think that in general we're unlikely to find that "missing" characters

> are variants of characters with separate code points.  See, for
> instance, this photo I took in Maibara last Summer:

> http://ansuz.sooke.bc.ca/gallery/index.php?display=2011-jtrip-17%2FP1...

> Out of four characters, three fail to match Unicode's Japanese example
> glyphs:

>   湯 - nonstandard simplification.  On the sign the upper right component
>        looks like 口, but Unicode's examples for all languages have it
>        looking like 日.

==> Right. I don't think there is any character set that has 口 at the top
rather than 日. It doesn't look familiar, so this is probably an
abrieviation akin to the kinds of stroke omissions made on highway signs.
You can see examples of this phenomenon on this website, for example:
http://portal.nifty.com/2009/04/10/a/

>   谷 - standard form is on the sign
>   神 - Unicode's Japanese example glyph has the small stroke at upper left
>        vertical, but the sign has it diagonal, typical of Chinese style.

==> This is the kind of difference that Japanese language teachers would
mark wrong but most people wouldn't notice. But you're right. There aren't
two JIS or Unicode points for this level of variation. That is a font thing
as far as I know.

>   社 - Unicode's Japanese example uses the newer form of the left-side
>        radical, which looks like 礻, but the sign uses the form that
>        looks like 示, typical of Korean style.

==> Well, I don't know when the "F###" encodings were added to Unicode, but
as you can see from this e-mail,  社(U+FA4C) and 社(U+793E) have different
code points now. I have software from 2002 which had ALL but two of these
variations listed, but only gave JIS encodings for the characters in Jan's
list which have F### unicode code points. Therefore it is my guess that
those were added sometime after 2002. It is even possible that they were
added after those variants were officially added to the Jinmeiyo kanji
list.

> I don't think any of these have separate code points for the other forms.
> In general, if a glyph really is a "variant" form of a character, then
> under Unicode's unification policy it won't have a separate code point;

==> This is the Unicode policy as far as I know, but if Unicode refused to
give separate code points to these variants officially recognized by
Japanese law and Japanese Industrial Standards, Unicode would be incapable
of handling all the distinctions Japanese people want to make. I don't have
any references, but I suspect that Unicode violated their 'no separate code
points for variants" policy to accommodate the new Japanese policy.

Unicode would only assign one if they made a mistake (which sometimes

> happens) or if some other Unicode policy (in particular, round-trip
> compatibility) took precedence over that one.

==> Again, I think if Unicode didn't accommodate these distinctions, it
would have no chance to be adopted for use in computer encoding in Japan.

> For the separate code point to be numerically close is even less likely.

==>  For #1-46 out of 103 on Jan's list, it IS close. Please refer to the
attached excel sheet.
==> For #47-#103, as mentioned above, I think Unicode added these later to
accommodate the new Japanese standard (c. 2010)

> I don't think this is really a big problem.  There's no particular reason
> that KanjiVG needs a separate code point for every database entry.

==> Personally, it would be nice to have a complete set of characters that
are recognized as being 'standard' in Japanese law. That said, other than
on signs like the one you saw in Maibara, I don't think they are used very
much, and they all have variants. For the two in Joyo kanji however, I
think they should be added for sure.

Already we have multiple XML files for some code points; for instance,

> there are three for U+4FDA 俚.  In principle, there's no reason we
> couldn't have a database entry for a kanji that had no code point at all.
> We just need to have some other way of knowing what to name the file.

==> I will defer to your expertise in that area.

Here is a little more background on these various character sets. It's not
the complete picture by any means, but hopefully it will shed a little more
light onto why this situation has occurred.

Karl

Jan and Matthew,

Here's a summary of the 103 'jinmeiyo' (人名用) kanji which Jan mentioned as
missing from KanjiVG.
For an overview of these kanji, see:
http://en.wikipedia.org/wiki/Jinmeiy%C5%8D_kanji

I suspect the reasons these characters were not included in KanjiVG is as
follows:
KanjiVG is primarily a Japanese project, and the JIS X 0208 standard was
used from which to draw characters from, not Unicode, which includes
thousands or tens of thousands of characters which are not commonly used in
Japanese. JIS X 0208 includes "Level 1 & Level 2" JIS characters (a total
of 6879 characters), whereas JIS X 0213 includes "Levels 1-4" for a total
of 11,223 characters. For a description of the JIS kanji code standard, see:
http://ja.wikipedia.org/wiki/JIS%E6%BC%A2%E5%AD%97%E3%82%B3%E3%83%BC%...

KanjiVG data, as far as I know, was being entered around 2007 +/- a couple
of years. At that time there were only 285 Jinmeiyo kanji. I haven't
explicitly checked, but I think all or almost all of them are included in
JIX X 0208.

In September 2004, 484 new characters and 209 variant forms of joyo kanji
(the 1981 standard) were added, leaving 983 Jinmeiyou kanji.

On November 30, 2010, the Japanese government added 196 characters to the
Joyo kanji list, as well as shifting some form Jinmeiyo kanji to the Joyo
kanji list.

Unfortunately for computer users, the government's recognition of kanji
variants went against Unicode's policy of allowing font makers to take care
of variants and providing only one code point to cover all variants. This
policy is understandable, however, since it tries to accomodate Japanese
people's insistance that kanji for their names or for place names be
written one particular way.

Of the 103 kanji you reported as being included in the (new) Jinmeiyo kanji
but missing from KanjiVG, I verified that 99 were included in JIS X0213 but
not in JIS X0208. This explains why they were not included in KanjiVG.

For the remaining 4, U+7626 and U+7E6B were included in a list refered to
as "補助漢字" recoginized by JIS, but not included in either the JIS X0208 or
JIS X0208 standards.
Unicode U+4FF1 does not appear to be recognized in any JIS standard,
although it's JIS X0208 variant (U+5036) does exist in both unicode and JIS.
Finally U+541E is very similar in apparance to U+5451, which is a JIS X0208
character, but I couldn't find anything that indicated that they are
variants of each other, or if they were completely different characters.

By the way, the glyphs for all kanji in your list with unicode codepoints
startin with F### were not correct -- or at least they were somehow
transformed to their JIS X0208 equivalants when I copied them from the link
on github. I was able to successfully copy and paste the characters from
http://www.cojak.org/index.php?term=F9D0&function=code_lookup by entering
the unicode code that you listed. The glyph you listed made it easy for me
to find the JIS X0208 versions of the characters.

I still think it's good that you found these, and I think they should be
added to KanjiVG if possible. I hope this sheds a bit of light as to what
those characters are, and why they were not included in KanjiVG from the
beginning. Also, in my experience, many software packages seem to have
difficulty handling characters that are outside the JIS X0208 character
set, although I seem to be able to display everything in Microsoft Excel
now.

Karl


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Karl Rosvold  
View profile  
 More options Jul 8 2012, 5:05 am
From: Karl Rosvold <karl.rosv...@gmail.com>
Date: Sun, 8 Jul 2012 18:05:15 +0900
Local: Sun, Jul 8 2012 5:05 am
Subject: Re: [kanjivg] Uncovered characters

Here's the Excel file.

  KanjiVG missing Jinmeiyo kanji.xlsx
24K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
msk...@ansuz.sooke.bc.ca  
View profile  
 More options Jul 8 2012, 8:49 am
From: msk...@ansuz.sooke.bc.ca
Date: Sun, 8 Jul 2012 07:49:05 -0500 (CDT)
Local: Sun, Jul 8 2012 8:49 am
Subject: Re: [kanjivg] Uncovered characters

On Sun, 8 Jul 2012, Karl Rosvold wrote:
> ==> This is the kind of difference that Japanese language teachers would
> mark wrong but most people wouldn't notice. But you're right. There aren't
> two JIS or Unicode points for this level of variation. That is a font thing
> as far as I know.

The question is how deeply we want to capture this kind of varation in
KanjiVG.  I think it's likely that in many cases, we *will* want to
capture variations that Unicode doesn't.

> ==> Well, I don't know when the "F###" encodings were added to Unicode, but
> as you can see from this e-mail, 社 (U+FA4C) and 社(U+793E) have different
> code points now. I have software from 2002 which had ALL but two of these

Interesting. I was aware of the existence of the "Compatibility
Ideographs" block starting at U+F900, but hadn't realized it was relevant
to this case.  It looks like U+FA4C was added specifically to capture this
distinction for compatibility with the Japanese standards that give the
two character forms (which Unicode would otherwise unify) separate code
points.  For the reasons you describe, Unicode must permit separating them
now that the Japanese government does; that is Unicode's own policy,
round-trip compatibility trumps unification; but it becomes complicated
when a change in an external standard forces Unicode to un-unify two
characters it had formerly unified.

Saying they "have different code points now" may be overly simplistic. We
have the code point U+FA4C specifically for use in systems that require
drawing this distinction for compatibility with Japanese government
standards, but U+793E remains the standard code point for this character
elsewhere regardless of its form!  That's what having a "compatibility"
code point means.  For instance, the Koreans are not going to change their
documents and fonts to use U+FA4C instead of U+793E; Korean fonts will
still have U+793E looking like the Japanese version of U+FA4C rather than
like the Japanese version of U+793E.

> any references, but I suspect that Unicode violated their 'no separate code
> points for variants" policy to accommodate the new Japanese policy.

I think they not so much violated it as they have another policy
(round-trip compatibility) that takes precedence over unification.
--
Matthew Skala
msk...@ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Karl Rosvold  
View profile  
 More options Jul 8 2012, 9:05 am
From: Karl Rosvold <karl.rosv...@gmail.com>
Date: Sun, 8 Jul 2012 22:05:06 +0900
Local: Sun, Jul 8 2012 9:05 am
Subject: Re: [kanjivg] Uncovered characters
Hi Matthew.

I see. I didn't choose the best words in my original message, but I agree with everything you say. That's an interesting and accurate point you make about the Korean version of 社。

Karl

On Jul 8, 2012, at 9:49 PM, msk...@ansuz.sooke.bc.ca wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jan Eichhorn  
View profile   Translate to Translated (View Original)
 More options Jul 8 2012, 2:15 pm
From: Jan Eichhorn <jan.philipp.eichh...@googlemail.com>
Date: Sun, 8 Jul 2012 11:15:49 -0700 (PDT)
Local: Sun, Jul 8 2012 2:15 pm
Subject: Re: [kanjivg] Uncovered characters

Hi,
I checked the coverage of JIS X0208. Kanjivg covers indeed all 6355 kanji +
the kana. Not covered are the latin/greek/cyrillic alphabets and the
symbols. The characters belonging exclusively to JIS 212/213 are almost
entirely uncovered. The exceptions are:
(only found in 212)

5F50 彐

6220 戠

(only found in 213)

342C 㐬

34C1 㓁

4EBB 亻

590D 复

5C03 尃

8002 耂

8279 艹

98E0 飠

9EC3 黃

9ED1 黑

200A4 𠂤

26951 𦥑

Since I had a list of most frequent Chinese characters, I checked that as
well. From the 3000 most frequent simplified characters, 1006 are missing
in kanjivg. If they are converted to traditional characters, only 242 are
missing.

> By the way, the glyphs for all kanji in your list with unicode codepoints

startin with F### were not correct

They look okay in my output but wrong in the gist.

I think the addition of radicals and strokes is most important. That also
covers most of the missing references in the kanjivg attributes and would
make kanjivg self-contained.

Jan


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
msk...@ansuz.sooke.bc.ca  
View profile  
 More options Jul 8 2012, 2:51 pm
From: msk...@ansuz.sooke.bc.ca
Date: Sun, 8 Jul 2012 13:51:52 -0500 (CDT)
Local: Sun, Jul 8 2012 2:51 pm
Subject: Re: [kanjivg] Uncovered characters

On Sun, 8 Jul 2012, Jan Eichhorn wrote:
> > By the way, the glyphs for all kanji in your list with unicode codepoints
> startin with F### were not correct

> They look okay in my output but wrong in the gist.

The font installed in my XFCE Terminal program on a current installation
of Arch Linux displays glyphs that don't agree with the current Unicode
charts.  For instance, it displays U+FA4C 社 as something close to
[lr]王豕, but with an extra short crossing stroke overlayed on 豕.  That
seems to be a glyph for Unicode's U+FA4A instead of U+FA4C.  My guess is
that this font may have been designed for some kind of proposal of
compatibility code points before they actually became part of Unicode.  I
don't know what font it is, but if it's a popular free one, it may be the
same one you're seeing in the "gist" on Github.
--
Matthew Skala
msk...@ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »