[Qhanzi] Accuracy increased; data reloading without restart

44 views
Skip to first unread message

Ben Bullock

unread,
Nov 23, 2019, 6:51:54 AM11/23/19
to sljfaq.org
I've found a significant improvement to the accuracy of the handwritten recognition at qhanzi.com. The improved algorithm gets about 2.3% more characters recognised with about 4% improvement in position (whether the best match is on the left side). Unfortunately it does actually get some things worse though.

The data rebuilds have been improved because they were not being done correctly for some cases due to random ordering of the associative arrays. Currently there are about 19,000 hanzi out of a total of the 21,000 in Unicode.

The application now also has the ability to reload its data file without restarting the process.


Ben Bullock

unread,
Nov 25, 2019, 4:06:16 AM11/25/19
to sljfaq.org


On Saturday, 23 November 2019 20:51:54 UTC+9, Ben Bullock wrote:
I've found a significant improvement to the accuracy of the handwritten recognition at qhanzi.com. The improved algorithm gets about 2.3% more characters recognised with about 4% improvement in position (whether the best match is on the left side). Unfortunately it does actually get some things worse though.

Another improvement today, I was able to get about 1% more characters recognised with about 6% improvement in position.
 
The data rebuilds have been improved because they were not being done correctly for some cases due to random ordering of the associative arrays. Currently there are about 19,000 hanzi out of a total of the 21,000 in Unicode.

No new data today though. 

Ben Bullock

unread,
Dec 3, 2019, 1:43:16 AM12/3/19
to sljfaq.org
I've done another rebuild of the qhanzi server today with some improvements in accuracy for some use cases.

Ben Bullock

unread,
Dec 4, 2019, 9:22:54 AM12/4/19
to sljfaq.org
I've just done another rebuild. This improves the accuracy in a small number of cases.


Ben Bullock

unread,
Dec 5, 2019, 1:59:44 AM12/5/19
to sljfaq.org
I've done another rebuild. I do not predict any change for users. This tidies up things from when the QHanzi server was translated from C into Go. A lot of the lengthy function names and the extensive error checks in the C code have turned out to be unnecessary in the Go version, so I've removed them. A full test on the user data confirmed that this does not affect the accuracy of the server. As usual these rebuilds can cause unexpected problems, which is why I always make a post about them.

Ben Bullock

unread,
Dec 11, 2019, 4:15:04 AM12/11/19
to sljfaq.org
I've done another rebuild with another small improvement in recognition accuracy. For people interested in the exact numbers, here they are:

Before: Total 14563 tests, found: 13620 (93.525%), average position: 2.08142437591777.

After: Total 14553 tests, found: 13675 (93.967%), average position: 2.03948811700183.

"Found" refers to anything which is found within the top twenty, and the average position is the position of the hanzi from the left to the right of the list, so lower is better. Both numbers improved. The eagle-eyed will notice that there are more tests before than after. This is because I actually ran the "Before" tests after the "After" tests and I'd added ten more tests in the mean time.

This rebuild also fixes a very rare bug which would happen under a certain set of circumstances.

This is unrelated to yesterday's problems, which were due to a fileserver issue at the web host plus some sort of communication problem within the web hosts. That affected a lot of other sites as well, apparently.


Ben Bullock

unread,
Dec 16, 2019, 9:34:28 PM12/16/19
to sljfaq.org
I've done another rebuild. This gives small improvements in matching.

Ben Bullock

unread,
Dec 17, 2019, 9:48:38 PM12/17/19
to sljfaq.org
Another rebuild fixes various bugs in the recognition. It's a fairly significant improvement:

< Total 15684 tests, found: 14797 (94.345%), average position: 2.02527539366088.
---
> Total 15651 tests, found: 14713 (94.007%), average position: 2.05070345952559.

Looking at it inversely, of the remaining 6% of unrecognised inputs before this change, a fairly big fraction of them are now recognised better, so only 5.65% of inputs are now unrecognised.

I'm currently working on a fairly big change in the algorithm.

Ben Bullock

unread,
Dec 17, 2019, 11:56:02 PM12/17/19
to sljfaq.org
I found an arithmetic bug, so another rebuild. Numbers as follows:

< Total 15722 tests, found: 14857 (94.498%), average position: 2.01507706804873.
---

Ben Bullock

unread,
Dec 24, 2019, 2:33:37 AM12/24/19
to sljfaq.org
Another rebuild, fixing another arithmetic bug and some resulting parameter adjustments resulted in some more accuracy improvements:

Before: Total 15722 tests, found: 14857 (94.498%), average position: 2.01507706804873.  
After: Total 16333 tests, found: 15459 (94.649%), average position: 2.01125557927421.

This means that around 94.65% of user inputs which a human can recognise unambiguously are being recognised, with the average position of the recognised character 2.011, where 1 would represent perfect recognition, with the character on the left. As usual, the number of tests is growing.

Ben Bullock

unread,
Dec 24, 2019, 8:08:24 PM12/24/19
to sljfaq.org
An error in the build script meant that the version on the website was not being updated correctly. There were some errors in the log about failed user inputs, but I believe the bug which affected those users was just an old bug which is fixed in the new version. The new qhanzi server dated 25 December is now running on the website. I'll be more careful to check the update has succeeded in future.



Ben Bullock

unread,
Dec 30, 2019, 7:48:33 AM12/30/19
to sljfaq.org
Another rebuild and data update. The rebuild slightly improves the accuracy for some kinds of inputs. The main change is the increase in the amount of hanzi/kanji data. Of the 28032 hanzi in the Unicode zero plane and in IDS, Qhanzi.com now recognises 25,337, including such things as 䨺 and 尛 and kanjis which contain them as components, and various other things I've spotted in the user logs such as 卐, 堃, as well as various things like 甹 which are also components of other kanji like 聘 which had been missing up to now. There is still a lot of very detailed work to do though, unfortunately.

Some of the "compatibility ideographs" like 女 (U+F981), a version of 女 (U+5973) have also been rejected from the data since they clearly are not useful outcomes for a handwriting recogniser.



Ben Bullock

unread,
Dec 30, 2019, 10:21:06 AM12/30/19
to sljfaq.org
On Mon, 30 Dec 2019 at 21:48, Ben Bullock <benkasmi...@gmail.com> wrote:
Another rebuild and data update. The rebuild slightly improves the accuracy for some kinds of inputs. The main change is the increase in the amount of hanzi/kanji data. Of the 28032 hanzi in the Unicode zero plane and in IDS, Qhanzi.com now recognises 25,337, including such things as 䨺 and 尛 and kanjis which contain them as components, and various other things I've spotted in the user logs such as 卐, 堃, as well as various things like 甹 which are also components of other kanji like 聘 which had been missing up to now. There is still a lot of very detailed work to do though, unfortunately.

I thought of a trick to speed things up and the numbers now look like this:

Total of 25612 hanzi, 1955 of Unicode Han characters not supported, with 27567 possible chars. This is after rejecting about 500 compatibility ideographs, so it is more than it looks compared to the above numbers.

It's now 12:20 am so I think I should call it a day.
 
Some of the "compatibility ideographs" like 女 (U+F981), a version of 女 (U+5973) have also been rejected from the data since they clearly are not useful outcomes for a handwriting recogniser.

I've decided to reject all the compatibility ideographs now. I haven't carefully checked them but a spot check suggests they are all just versions of other things. I'll have to look up what these things are at some point.

 

Ben Bullock

unread,
Dec 30, 2019, 10:24:02 AM12/30/19
to sljfaq.org
On Tue, 31 Dec 2019 at 00:20, Ben Bullock <benkasmi...@gmail.com> wrote:
Total of 25612 hanzi, 1955 of Unicode Han characters not supported, with 27567 possible chars. This is after rejecting about 500 compatibility ideographs, so it is more than it looks compared to the above numbers.

Sorry, this makes no sense. It should say that qhanzi.com now recognises 25612 hanzi out of 27,567 possible Unicode zero plane han characters. The remaining 1,955 characters are not yet supported.
 
It's now 12:20 am so I think I should call it a day.

I think I'll blame the error on the time!

Ben Bullock

unread,
Jan 1, 2020, 7:41:20 PM1/1/20
to sljfaq.org
Another rebuild, this adds:

* Some more data points (ununusual kanji)

The actual numbers from the program look like this:

25677 OK, 1890 failed, total 27567.

Here OK is the number of kanji I have data for, 1890 is the number of Unicode plane 0 kanji I have no data for, and the total is the sum of these two numbers.

This increase in the number of kanji actually decreases the recognition accuracy slightly, since more false positives occur.

* Better server error messages

There have been a few problems with the server involving a "panic" due to some kind of input, and the typical problem with web programming occurred, which is trying to reconstruct the input which causes the error. There are about 100,000 inputs a day to qhanzi.com, so just searching through them is an issue in itself, you can't even run command line "grep" on them because it doesn't accept that many arguments. Up to now I was just using the Go language http server handler, but what I've done here is to add more detail to the error message which happens when a panic occurs, and also mark the log file generated as being one which caused an error. I've never used this facility of the language before so I'm bracing myself for it to go wrong somehow when put into action.

* A slight improvement in matching

I've added a matching improvement for some kinds of input.

I wish all members of this group a happy new year.




Ben Bullock

unread,
Jan 2, 2020, 11:54:51 PM1/2/20
to sljfaq.org
On Thu, 2 Jan 2020 at 09:41, Ben Bullock <benkasmi...@gmail.com> wrote:
This increase in the number of kanji actually decreases the recognition accuracy slightly, since more false positives occur.

Just to illustrate this, one of the new characters I added was 冋, which is a very obscure character but it now tends to crop up in searches. Here is a user input which illustrates the problem:

obscure.png

The user is clearly trying to write 句, but a known problem with the recognition causes the obscure character to appear first.

I'm working on improving this.

Ben Bullock

unread,
Jan 15, 2020, 6:23:13 PM1/15/20
to sljfaq.org
I'm doing another rebuild of the qhanzi.com software to reflect improvements in the algorithm. The improvement looks like this:

Before: Total 19012 tests, found: 17980 (94.572%), average position: 2.00578420467186.
After: Total 19292 tests, found: 18282 (94.765%), average position: 1.98112897932393.

It is an incremental improvement but actually a fairly new direction in the matching algorithm which turned out to work quite well.

Ben Bullock

unread,
Jan 23, 2020, 3:59:35 AM1/23/20
to sljfaq.org


On Thursday, 16 January 2020 08:23:13 UTC+9, Ben Bullock wrote:
I'm doing another rebuild of the qhanzi.com software to reflect improvements in the algorithm. The improvement looks like this:

Before: Total 19012 tests, found: 17980 (94.572%), average position: 2.00578420467186.
After: Total 19292 tests, found: 18282 (94.765%), average position: 1.98112897932393.

Another rebuild with more improvements now gives the following accuracy:

Total 19942 tests, found: 18946 (95.006%), average position: 1.9503853056054.

I've also added a small number of new hanzi/kanji data, based on some missing things I noticed when looking through the user data. There are still a number of things not covered. At some point I think what I'll do is to generate a page describing the matching status of each hanzi/kanji.

 

Ben Bullock

unread,
Jan 24, 2020, 7:10:21 PM1/24/20
to sljfaq.org
I've done another rebuild which was necessitated by the fact that there were some stray print statements in the binary, leading to about 5M of logs this morning. The new method of matching which I'm using is actually something I had tried and given up on a few months ago due to lots of bad results, but I worked out a trick to make it useful by discarding the bad results.

There is also an increase in accuracy of matching in this rebuild. The accuracy improvement meant that actually I found some errors in the test file itself. After removing the errors from the test file (here an error means a picture of a kanji matched with a text kanji which didn't actually represent the kanji drawn, but something else), it turned out that the test numbers don't really reflect the improvements made, so I won't post the summary results. It's still running at just over 95% accuracy. 

By comparison the stroke order dependent version of kanji.sljfaq.org is about 99.9% accurate on my test data. The stroke order independent matching at kanji.sljfaq.org has never been tested properly but it is based on the old version of qhanzi.com's software anyway.

The new approach to matching is working quite well, so I'll be working on more improvements in this direction.



Ben Bullock

unread,
Jan 28, 2020, 8:30:53 AM1/28/20
to sljfaq.org
I've done another rebuild after I found a radical improvement to the matching algorithm:

Before: Total 19993 tests, found: 19001 (95.038%), average position: 1.94589758433767.
After: Total 20228 tests, found: 19391 (95.862%), average position: 1.88040843690372.

This takes us from about 95% accuracy to 96% accuracy with the matching position also improved radically.


Ben Bullock

unread,
Jan 29, 2020, 1:28:24 AM1/29/20
to sljfaq.org
A series of programmer errors caused the computer code to do unnecessary calculations. This has now been fixed. A very small improvement in accuracy was also made. The results are almost identical to the previous version, but the CPU cost of the calculations should be significantly reduced.





Ben Bullock

unread,
Feb 6, 2020, 12:53:15 AM2/6/20
to sljfaq.org
Another rebuild, this is the first time the measured accuracy has gone over 96%:

Total 21237 tests, found: 20393 (96.026%), average position: 1.86868042955916.
Reply all
Reply to author
Forward
0 new messages