[qhanzi] Rebuild of server and additional "oddbod" characters

41 views
Skip to first unread message

Ben Bullock

unread,
Mar 16, 2023, 9:30:20 PM3/16/23
to sljfaq.org
I've done a rebuild of the qhanzi.com handwriting recogniser.  There have been some internal changes related to shutting down and starting up the server which I won't detail. I've also added a few "oddbod" characters, including 囙, 䍏, 龷, 吏, and 囧, and derived characters like 㴄 and 㤙.

I've also removed all of the Unicode CJK compatibility ideographs from U+F900 to U+FAFF. I am not sure why but these had been removed before, then the code to remove them all was commented out again, but I had not written any reason for commenting out that code. Then lots of these "funny" characters which look just like other characters (the compatibility characters) were popping up in the list of unknown characters, which was confusing since they appear to be common characters which should be recognised, but they are actually lookalikes which don't need to be dealt with.

As far as I can understand compatibility means they are there for the purpose of compatibility with encoding standards rather than for actual use in writing so I've just removed them again, but perhaps I will remember why I had included these things.

Ben Bullock

unread,
Mar 18, 2023, 8:08:57 PM3/18/23
to sljfaq.org
I got more and more interested in filling in the gaps so yesterday I spent most of the day doing more and more work on adding the final obscure characters. It went from about 520 remaining obscure characters which aren't recognised to 364, so I got about a third of the remaining ones done. 

Some things I added include 㽔, which seems to be an extremely obscure character which only has a Korean reading in Wiktionary, and 奊 as well as the oddities such as 乆, 㐄, 㒫, 㘴, and 叏. I actually made a simple JavaScript tool for creating the recognition data graphically.

I've also increased the number of rejected characters such as 㫇, 亪 and 兯 which seem to consist of Korean elements and kanji elements blended together. I'm not sure whether  㽔 actually should be counted in that list although it seems to have a Cantonese reading according to Wiktionary.

I'm not sure I will ever get all of the characters done though. Things like 飗 and 䧪 are still not recognised.


Ben Bullock

unread,
Mar 23, 2023, 7:51:21 PM3/23/23
to sljf...@googlegroups.com
I've just done another release of the software for qhanzi.com.

This involves adding another 100 or so more characters to the database, which are things like 㢤, 㢳, 畞, and 舃.

Most of the recognition data for each character has to be constructed one by one since the characters have very little in common with any other existing things.

I'm trying to do a little more each day and hopefully get the bulk of these done eventually.

Reply all
Reply to author
Forward
0 new messages