Does anyone have, or know of, a ready-made list that converts "exotic"
kanji into their simpler form. It would look something like:
學,学
國,国
體,体
燈,灯
辨,弁
瓣,弁
辯,弁
麵,麺
鷽,鴬
鶯,鴬
(Those came from
https://en.wikipedia.org/wiki/Extended_shinjitai )
I'm expecting such a list would have 3000+ entries to be useful.
My motivation is to normalize any text to use as few different unicode
code points as possible without losing any meaning or nuance (except for
the nuance of being formal or archaic). That might even mean a few
joyo-kanji end up in the left side of the list.
(I realize this will get subjective, e.g. putting uso and uguisu in the
list, when they are different species of bird, might be crossing the line.)
I'm hoping someone already maintains such a list. If not, is all the
information needed already in JMDict do you think? (I'm thinking if a
character only appears in an entry where there is another way of
representing it, that is listed first, then I can extract that as a
conversion? A quick check tells me 國 to 国 would be extracted by that
algorithm, but 鶯 and 鴬 would stay distinct.)
Darren
P.S. I'm also looking to do the same with Chinese, both traditional and
simplified; any leads on that would also be gratefully received.