Kanjidic skip code comments and corrections

27 views
Skip to first unread message

Ben Bullock

unread,
Feb 1, 2021, 2:55:30 AM2/1/21
to edict-...@googlegroups.com
I've made a page here with some SKIP code comments and corrections:

https://kanji.sljfaq.org/news/skip.html

I started out the project with an old kanjidic2.xml from 2015 and found that one of the corrections I made had already been applied (the one in pink). The kanjidic2.xml has since been updated to the latest one.

The comments part of page is also available in machine-readable format in case that makes things easier:


The keys of each segment are k for kanji, skip for the current skip code of kanjidic2.xml, disc for discussion of what seems to be the problem, and co for my suggested correction.

The current content represents two checks out of several possible ones, so there are more of these to come, but rather than send a lot of email here I'll just add them to the above page as they get done.

Jim Breen

unread,
Feb 2, 2021, 1:42:05 AM2/2/21
to edict-...@googlegroups.com
Thanks for that list of possibly incorrect SKIP codes.

Most of the kanji are from the JIS X 0212 set. The SKIPs for them were
compiled in the mid-late 90s by volunteers and may well have a lot of
errors. I don't think any of the JIS212 kanji are in Jack Halpern's
dictionaries. Looking at the suggestions, it seems most of them are
quite sound. I'll go through the data files for those kanji and change
the codes over.

One I won't change is 太 (currently 2-3-1 but proposed to be 4-4-4).
This is a common JIS 208 kanji and is in Halpern's books as 2-3-1. In
fact Jack has it indexed by 4-4-4 as well as a mis-classification. The
kanjidic2.xml has:
<q_code qc_type="skip">2-3-1</q_code>
[...]
<q_code qc_type="skip" skip_misclass="posn">4-4-4</q_code>

I'll reply back here when I have gone through the list.

Jim
> --
> You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/CAN5Y6m-m7DxmWJSpVA%2B_nkqufNXfc2jNfUzh72p9mjeA0d7aiQ%40mail.gmail.com.



--
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/

Jim Breen

unread,
Feb 2, 2021, 6:00:34 AM2/2/21
to edict-...@googlegroups.com
Cases where I don't agree with Ben's suggestions.

禸 - currently 2-1-4. Ben suggests 4-4-4. I think it's 4-5-1. It's like
襾 which is 4-6-1.
惞 - I think this is 1-3-8. 慚 is 1-3-11.
隳 - 2-12-6 more likely

I've made these (and the other changes Ben suggested). They'll be in
the next Kanjidic release.

Jim

Ben Bullock

unread,
Feb 2, 2021, 6:35:46 AM2/2/21
to edict-...@googlegroups.com
On Tue, 2 Feb 2021 at 15:42, Jim Breen <jimb...@gmail.com> wrote:
Thanks for that list of possibly incorrect SKIP codes.

Thanks for looking into that. 
 
Most of the kanji are from the JIS X 0212 set. The SKIPs for them were
compiled in the mid-late 90s by volunteers and may well have a lot of
errors. I don't think any of the JIS212 kanji are in Jack Halpern's
dictionaries. Looking at the suggestions, it seems most of them are
quite sound. I'll go through the data files for those kanji and change
the codes over.

I'm gradually adding things to the file so what I'll do is pink-out the done ones as new versions of kanjidic2.xml are released. It's rather a fiddly job and not possible to instantly do since each thing needs to be checked.

One I would like to draw your attention to is 隺 for which kanjidic appears to have an incorrect stroke count of 11, it should be 10. Amusingly it's possible to work out which kanji sites are using kanjidic for their information source by looking at what stroke count this has at each web site.

Also I would guess that Halpern has 竸 in the dictionary, but kanjidic has two different things for it, 1-11-11 and 2-2-8, both mathematically unlikely given the symmetry in the character (how would two identical parts result in an odd number of strokes or a different number of strokes left and right?)
 

One I won't change is 太 (currently 2-3-1 but proposed to be 4-4-4).
This is a common JIS 208 kanji and is in Halpern's books as 2-3-1. In
fact Jack has it indexed by 4-4-4 as well as a mis-classification. The
kanjidic2.xml has:
<q_code qc_type="skip">2-3-1</q_code>
[...]
<q_code qc_type="skip" skip_misclass="posn">4-4-4</q_code>

Sorry I neglected the extra SKIP codes in my analysis.
 
I'll reply back here when I have gone through the list.

According to the version of kanjidic2.xml mentioned on the page above, there are 13108 characters in total but only 12156 have skip codes.

I've put skip codes for the remaining ones here:


The ids-no-decomp ones are all set to 4-*-4 but they might be 4-*-1 or other. The "by-hand" ones are ones I did myself. The others were all computed in an automated fashion from the IDS data, using the J (Japanese) decomposition where it was available.

I'll put that page up as HTML also in due course.


Ben Bullock

unread,
Feb 2, 2021, 7:49:37 AM2/2/21
to edict-...@googlegroups.com
I've put the SKIP codes in HTML format here:


The fuchsia-coloured ones are the ids-no-decomp ones which may be wrong.

The pale pink ones are the ones I did "by hand". The others were all generated automatically.

I forgot to mention in the last message, but I used the stroke counts from the Unihan database for this work.

Mike Morrison

unread,
Feb 2, 2021, 12:41:16 PM2/2/21
to edict-...@googlegroups.com
Should 禸 be 4-5-1, or 4-5-4? In the rendered glyphs I'm seeing, I'm not seeing a horizontal stroke at the very top.

Ben Bullock

unread,
Feb 2, 2021, 7:07:12 PM2/2/21
to edict-...@googlegroups.com
On Wed, 3 Feb 2021 at 02:41, Mike Morrison <mi...@mikemorr.com> wrote:
Should 禸 be 4-5-1, or 4-5-4? In the rendered glyphs I'm seeing, I'm not seeing a horizontal stroke at the very top.

I still think it's 4-4-4, the ム element is usually two strokes as is the 冂, and there is no line at the top. If you look at the 4-5-1 matches here:


none of them have a line going through the top, whereas the 4-5-4 matches do:


(Incidentally I have just found a bug in kanji.sljfaq.org's skip code initialisation, so the above results may show up wrongly, use SHIFT-reload if so.)

Mike Morrison

unread,
Feb 2, 2021, 7:13:47 PM2/2/21
to edict-...@googlegroups.com
I don't have an educated opinion on the stroke count of , but item 4 at http://www.edrdg.org/wiki/index.php/KANJIDIC_Project#Other_Stroke_Patterns may be relevant.

--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.

Jim Breen

unread,
Feb 2, 2021, 7:34:38 PM2/2/21
to edict-...@googlegroups.com
On Wed, 3 Feb 2021 at 11:07, Ben Bullock <benkasmi...@gmail.com> wrote:
> On Wed, 3 Feb 2021 at 02:41, Mike Morrison <mi...@mikemorr.com> wrote:
>> Should 禸 be 4-5-1, or 4-5-4? In the rendered glyphs I'm seeing, I'm not seeing a horizontal stroke at the very top.
> I still think it's 4-4-4, the ム element is usually two strokes as is the 冂, and there is no line at the top. If you look at the 4-5-1 matches here:
[...]
As Mike Morrison points out, I've been using 5 as the standard count
for this pattern. I'd better stick to that.
On reflection I think this should be 4-5-4; not 4-5-1. I'm not that
familiar with the 4-* patterns.

Jim

Jim Breen

unread,
Feb 2, 2021, 8:21:08 PM2/2/21
to edict-...@googlegroups.com
On Tue, 2 Feb 2021 at 22:35, Ben Bullock <benkasmi...@gmail.com> wrote:

> One I would like to draw your attention to is 隺 for which kanjidic appears to have an incorrect stroke count of 11, it should be 10. Amusingly it's possible to work out which kanji sites are using kanjidic for their information source by looking at what stroke count this has at each web site.

Yes, it should be 10 and the SKIP should be 2-3-7. Fixed.

Having been spreading Japanese lexical data around the network world
for about 30 years, It's hard not to come across it all the time. It's
also often possible to detect the sites/systems which don't update
their files. I get a bit cross when people contact me about errors
that were in fact fixed ages ago. I get even crosser when I cop abuse
on various forums for those long-fixed errors. (And don't get me
started on people who prefer to spray criticisms over proposing
corrections.)

> Also I would guess that Halpern has 竸 in the dictionary, but kanjidic has two different things for it, 1-11-11 and 2-2-8, both mathematically unlikely given the symmetry in the character (how would two identical parts result in an odd number of strokes or a different number of strokes left and right?)

Halpern only has this in one of his later dictionaries, of which I
don't have a copy. I think the 1-11-11 is correct; it's consistent
with the 1-10-10 he has for 競. The misclassification code for that one
is 2-10-10, so for 竸 I'm making it 2-10-12. The stroke count of 22 is
supported by several sources, including Unihan.

> According to the version of kanjidic2.xml mentioned on the page above, there are 13108 characters in total but only 12156 have skip codes.

Yes, the ~900 are the kanji that are in JIS X 0213 but not in JIS X 0212.

> I've put skip codes for the remaining ones here:
>
> https://kanji.sljfaq.org/news/missing.json

Thanks. That looks very useful. I'll add them to the kanjidic file for
JIS213. Great to have them.

Jim

Jim Breen

unread,
Feb 7, 2021, 10:54:08 PM2/7/21
to edict-...@googlegroups.com
Just a bit more follow-up on this.

First, thanks for the SKIPs for the ~900 "new" JIS213 kanji. I ran
them through my updater script and they are now in kanjidic2.xml.

Second, I've looked at the additional kanji at the bottom of the page
at https://news.sljfaq.org/skip.html In particular the last 4.
For 冄, 韱 and 韯 I think the proposed 4-* SKIPs are probably correct.
I've added them to the database driving kanjidic2.
For 竸, as indicated earlier, I'm pretty sure 1-11-11 is correct.
I'm really not sure about 隺. I'm inclined towards staying with 2-3-8.

Anyway, I'll run all those ones past Jack Halpern and get his opinion/ruling.

Cheers

Jim

Ben Bullock

unread,
Feb 7, 2021, 11:21:21 PM2/7/21
to edict-...@googlegroups.com
On Mon, 8 Feb 2021 at 12:54, Jim Breen <jimb...@gmail.com> wrote:
Just a bit more follow-up on this.

First, thanks for the SKIPs for the ~900 "new" JIS213 kanji. I ran
them through my updater script and they are now in kanjidic2.xml.

I'm glad the formatting was not too difficult to work with.

 
Second, I've looked at the additional kanji at the bottom of the page
at https://news.sljfaq.org/skip.html In particular the last 4.
For 冄, 韱 and 韯 I think the proposed 4-* SKIPs are probably correct.
I've added them to the database driving kanjidic2.

OK, those seem unlikely.
 
For 竸, as indicated earlier, I'm pretty sure 1-11-11 is correct.

Definitely correct, sorry, I should have admitted my error.
 
I'm really not sure about 隺. I'm inclined towards staying with 2-3-8.

Anyway, I'll run all those ones past Jack Halpern and get his opinion/ruling.

 Thanks. 
Reply all
Reply to author
Forward
0 new messages