Table of kanji equivalents

Darren Cook

unread,

Jan 14, 2021, 5:39:23 AM1/14/21

to edict-...@googlegroups.com

Does anyone have, or know of, a ready-made list that converts "exotic"
kanji into their simpler form. It would look something like:

學,学
國,国
體,体
燈,灯
辨,弁
瓣,弁
辯,弁
麵,麺
鷽,鴬
鶯,鴬

(Those came from https://en.wikipedia.org/wiki/Extended_shinjitai )

I'm expecting such a list would have 3000+ entries to be useful.

My motivation is to normalize any text to use as few different unicode
code points as possible without losing any meaning or nuance (except for
the nuance of being formal or archaic). That might even mean a few
joyo-kanji end up in the left side of the list.

(I realize this will get subjective, e.g. putting uso and uguisu in the
list, when they are different species of bird, might be crossing the line.)

I'm hoping someone already maintains such a list. If not, is all the
information needed already in JMDict do you think? (I'm thinking if a
character only appears in an entry where there is another way of
representing it, that is listed first, then I can extract that as a
conversion? A quick check tells me 國 to 国 would be extracted by that
algorithm, but 鶯 and 鴬 would stay distinct.)

Darren

P.S. I'm also looking to do the same with Chinese, both traditional and
simplified; any leads on that would also be gratefully received.

Ben Bullock

unread,

Jan 14, 2021, 6:19:07 AM1/14/21

to edict-...@googlegroups.com

On Thu, 14 Jan 2021 at 19:39, Darren Cook <dar...@dcook.org> wrote:

Does anyone have, or know of, a ready-made list that converts "exotic"
kanji into their simpler form. It would look something like:

Kanjidic has some information about this.

I have a list here:

https://github.com/benkasminbullock/Lingua-JA-Moji/blob/master/lib/Lingua/JA/Moji/new_kanji2old_kanji.txt

I'm expecting such a list would have 3000+ entries to be useful.

Why is that?

My motivation is to normalize any text to use as few different unicode
code points as possible without losing any meaning or nuance (except for
the nuance of being formal or archaic). That might even mean a few
joyo-kanji end up in the left side of the list.

> A quick check tells me 國 to 国 would be extracted by that

algorithm, but 鶯 and 鴬 would stay distinct.)

Darren

P.S. I'm also looking to do the same with Chinese, both traditional and
simplified; any leads on that would also be gratefully received.

There are at least four Perl modules which contain the information:

https://metacpan.org/pod/distribution/Lingua-JA-Moji/lib/Lingua/JA/Moji.pod#Related-modules

For example there is a table here:

https://metacpan.org/source/AUDREYT/Encode-HanConvert-0.35/map%2Fb2g_map.utf8

That seems easy to use since it is in UTF-8.

Darren Cook

unread,

Jan 14, 2021, 6:36:55 AM1/14/21

to edict-...@googlegroups.com

> I have a list here:

> ...

Thanks, and for the other links.

>> I'm expecting such a list would have 3000+ entries to be useful.
>
> Why is that?

Apparently the Asahi Shinbun corpus had 4,476 unique kanji; 2000 only
covers 99.72% of the uses, 3000 still only covers 99.97% frequency [1].
So I'm guessing most of that 0.28%, representing 2,476 kanji, could be
replaced with a simpler character and not lose any meaning.

And Kanji Kentei tests on "all 6355 kanji in levels 1 and 2 of JIS X
0208" [2]. Same reasoning - I'm assuming the large majority of those
will have a simpler equivalent. Even the ones only used for people's
names and place names.

Darren

[1]: From my notes; sorry, I don't have the original reference recorded.
[2]: https://en.wikipedia.org/wiki/Kanji_Kentei#Level_1

Ben Bullock

unread,

Jan 14, 2021, 6:56:58 AM1/14/21

to edict-...@googlegroups.com

On Thu, 14 Jan 2021 at 20:36, Darren Cook <dar...@dcook.org> wrote:

>> I'm expecting such a list would have 3000+ entries to be useful.
>
> Why is that?

Apparently the Asahi Shinbun corpus had 4,476 unique kanji; 2000 only
covers 99.72% of the uses, 3000 still only covers 99.97% frequency [1].
So I'm guessing most of that 0.28%, representing 2,476 kanji, could be
replaced with a simpler character and not lose any meaning.

I don't think that all of these rare kanji can be replaced with another one. Usually they would be replaced with a phonetic character.

Jim Breen

unread,

Jan 15, 2021, 4:02:02 PM1/15/21

to edict-...@googlegroups.com

I don't know of any readymade comprehensive list. The ones mentioned so far look interesting.

At one stage I toyed with trying to include a normalisation process within wwwjdic, but it seemed better to concentrate on identifying and adding variant forms to JMdict. In fact it would an interesting little project to generate potential variant forms of compounds and then use the ngrams to detect whether they are common enough to record.

Jim

--
You received this message because you are subscribed to the Google Groups "EDICT-JMdict" group.
To unsubscribe from this group and stop receiving emails from it, send an email to edict-jmdict...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/edict-jmdict/ef09f8ee-d06b-bd6b-0103-1fa43a3eadf0%40dcook.org.

Reply all

Reply to author

Forward