As far as I can tell, that release includes tweaks from 2019 to the
model files which are just fixes to the config, not retraining.
The idea that retraining stopped because it was no longer necessary
seems a bit of a stretch to me, given the 100s of languages involved -
for example, the Traditional Chinese training data seems to indicate
it's missing quite a few of the standard characters, if I'm
interpreting
https://github.com/tesseract-ocr/langdata_lstm/blob/main/chi_tra/chi_tra.unicharset
correctly. (I am not a Chinese speaker, but there are 4808 very common
characters, plus 6329 less-common standard characters, and 18,319
rarely used but still standard characters, according to Wikipedia -
and that file only has 4591 lines, including a bunch of non-Chinese
characters.) Although perhaps languages with simpler character sets
and/or better training data have hit this limit.
My naive assumption when I originally encountered issues with
tesseract was that there would be some central repository of training
data which we would collaborate on extending and improving in an
open-source way, including with examples of bad results on fairly
clean inputs. Given that tesseract is focused on OCR of
machine-created text in the first place, creating synthetic datasets
also seems very viable.
Just to be clear, none of this is intended as a criticism of the
contributors to this project - just an attempt to understand the
situation.
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/e0ccfe29-b055-401a-8d1f-8cd684f36113n%40googlegroups.com.