[Request] Adding component characters

Jaycee Carter

unread,

May 25, 2022, 5:47:39 PM5/25/22

to sljfaq.org

Hi - thanks so much for this tool, as I find it extremely useful when quickly writing out IDS sequences.

I understand that you're trying to work out how to handle characters in extension B and beyond. I wondered if a good place to start might be adding characters from those ranges that are generally used as components of other characters. These tend to fall into two groups: those that are not easily decomposable (and therefore necessary when writing the IDS for a more complex character that contains them), and those which are decomposable but appear often enough in other characters that it would be helpful to avoid having to decompose them.

Some examples drawn from the whole range: 龷𠀁𠂆𠂇𠂉𠂌𠂎𥫗𠂼𠁤𫇦𫝀𫠠𫡏𭅰𰀁𰀉

Ben Bullock

unread,

May 25, 2022, 6:51:17 PM5/25/22

to sljfaq.org

On Thursday, 26 May 2022 at 06:47:39 UTC+9 jecart...@gmail.com wrote:

Hi - thanks so much for this tool, as I find it extremely useful when quickly writing out IDS sequences.

I understand that you're trying to work out how to handle characters in extension B and beyond. I wondered if a good place to start might be adding characters from those ranges that are generally used as components of other characters. These tend to fall into two groups: those that are not easily decomposable (and therefore necessary when writing the IDS for a more complex character that contains them), and those which are decomposable but appear often enough in other characters that it would be helpful to avoid having to decompose them.

I can add the bulk of these characters quite easily to the qhanzi.com search just by a simple change in one line of code. The problem is not just obtaining recognition data for these characters, but the following issues:

1. If I add these characters, the search space becomes enormous, resulting in much longer delays of results.

2. User results get swamped with characters which people may not want or expect. This has already happened as I expanded to cover more and more of Unicode plane 0, and odd things started to appear in search results. If I add in 40,000 more characters it will happen even more.

I am trying to guess whether people want these results or not from looking at the search results. I would say that once or twice in every few hundred inputs someone wants a plane 2 character. Sometimes they draw it again and again, to the extent that I start to feel a bit sorry for them. Sometimes it seems as if they are drawing a plane 2 character, but they might just be confused about a plane 0 character. It's difficult to decide.

Some examples drawn from the whole range: 龷𠀁𠂆𠂇𠂉𠂌𠂎𥫗𠂼𠁤𫇦𫝀𫠠𫡏𭅰𰀁𰀉

Yeah, these will hit the false positive problem (2 above) for the low-stroke-count things. Suppose someone wants katakana ナ and they get 𠂇. Similar for 手 and 𠂌.

As I mentioned above, I actually have much of the data, and I'm deliberately not using it on qhanzi.com at the moment. I'm thinking of making a new site with a new search over all of the Unicode characters. I might opt to have a "search" button like shapecatcher.com rather than automatically search on mouse up.

My main work on the website at the moment is to try to integrate the shape match with the other match, and also get the kanji dictionary improved, so this is not the top priority, but I am thinking about it.

Thank you for an interesting discussion.

Ben Bullock

unread,

May 29, 2022, 8:11:15 PM5/29/22

to sljfaq.org

On Thu, 26 May 2022 at 07:51, Ben Bullock <benkasmi...@gmail.com> wrote:

I can add the bulk of these characters quite easily to the qhanzi.com search just by a simple change in one line of code.

Today I got another one:

This is U+2b55f.

Having made the claim that the data could easily be generated, I thought I ought to test it out, so I tried adjusting the code today to see if I could actually construct them as simply as claimed. Anyway the recognition data was able to be generated, although about 10% of the total could not be constructed, but then the program which processes the data into the format I need actually ran out of memory, so I was not able to run the recogniser to recognise the above.

Reply all

Reply to author

Forward