why are there no new trained models since 2018?

311 views
Skip to first unread message

Liam Doherty

unread,
Feb 20, 2024, 12:43:36 AM2/20/24
to tesseract-ocr
Is this an issue of access to compute resources? access to training data? Are the current models considered as good as they can be?

Thanks,
Liam

W.t

unread,
Mar 15, 2024, 2:15:09 AM3/15/24
to tesseract-ocr
https://github.com/tesseract-ocr/tessdata_best/releases/tag/4.1.0 has models uploaded in 2021. There may be newer ones for 5 but I don't know where they are. 2021 is still a pretty long time though, I suppose they achieved as much as they could for general application and anything more requires training

Liam Doherty

unread,
Mar 15, 2024, 11:13:15 PM3/15/24
to tesser...@googlegroups.com
As far as I can tell, that release includes tweaks from 2019 to the
model files which are just fixes to the config, not retraining.

The idea that retraining stopped because it was no longer necessary
seems a bit of a stretch to me, given the 100s of languages involved -
for example, the Traditional Chinese training data seems to indicate
it's missing quite a few of the standard characters, if I'm
interpreting https://github.com/tesseract-ocr/langdata_lstm/blob/main/chi_tra/chi_tra.unicharset
correctly. (I am not a Chinese speaker, but there are 4808 very common
characters, plus 6329 less-common standard characters, and 18,319
rarely used but still standard characters, according to Wikipedia -
and that file only has 4591 lines, including a bunch of non-Chinese
characters.) Although perhaps languages with simpler character sets
and/or better training data have hit this limit.

My naive assumption when I originally encountered issues with
tesseract was that there would be some central repository of training
data which we would collaborate on extending and improving in an
open-source way, including with examples of bad results on fairly
clean inputs. Given that tesseract is focused on OCR of
machine-created text in the first place, creating synthetic datasets
also seems very viable.

Just to be clear, none of this is intended as a criticism of the
contributors to this project - just an attempt to understand the
situation.
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e0ccfe29-b055-401a-8d1f-8cd684f36113n%40googlegroups.com.

Tom Morris

unread,
Mar 16, 2024, 3:50:56 PM3/16/24
to tesseract-ocr
On Friday, March 15, 2024 at 11:13:15 PM UTC-4 lfdo...@gmail.com wrote:
My naive assumption when I originally encountered issues with
tesseract was that there would be some central repository of training
data which we would collaborate on extending and improving in an
open-source way, including with examples of bad results on fairly
clean inputs.

Ray Smith has been very generous with his time and Google's resources, but it's a bit of an asymmetric situation and the open source community, by and large, has not organized around wide scale retraining. The work that has been done is typically isolated, "one-of"s with the results not captured and used to improve the state of play. The groups that have put significant resources into training typically have a very focused goal such as early German blackletter, early modern printing, etc.
 
Given that tesseract is focused on OCR of
machine-created text in the first place, creating synthetic datasets
also seems very viable.

I think one issue with creating synthetic datasets is access to commercially licensed fonts. Google has the resources to purchase licenses for hundreds of commercial fonts and use them to render a great variety of line images, but there's no economical way for them to provide those fonts to the open source community for reuse. 

Training also requires a non-trivial amount of computing resources as well as some specialized knowledge. 

Tom

Liam Doherty

unread,
Mar 19, 2024, 1:38:27 AM3/19/24
to tesser...@googlegroups.com
Thanks, that's helpful. Is the collaboration with Google ongoing then?
Can you give me a sense of what magnitude of computing resources
training on the full dataset involves? Is it simply the days-to-weeks
per model described in the documentation? Would it be reasonable to
continually retrain existing models with additional
community-contributed data, rather than starting from scratch each
time?
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44dd22af-42fb-48f4-bf3b-9bbbe2c21a37n%40googlegroups.com.

Danny

unread,
Aug 2, 2024, 9:47:57 PM8/2/24
to tesseract-ocr
I recently retrained the chi_tra model with a new font. The existing model would confuse certain characters. In addition, the source images (I'm decoding TV subtitles) had a weirdly shaped question mark. In the sample below the last two characters output as the number "7".

chi_tra_7_0_QM.png

I managed to find and buy a font that was very close to the font but the question mark didn't match.  So I rendered the text to images without question marks, duplicated the data set and appended the question mark image to each line. Then merged the two data sets for input to training.

Training on my poky old 2.5G i5 CPU takes about 6 to 18 hours of unattended operation.  Getting the ground-truth sorted out took about 1-2 days of my direct work.

But despite all this training, it still fails with no output at all if the input has an ellipsis or three dots appended:
bad_sub_243.png
This seems to be a problem with the image preprocessing tesseract does when identifying blocks or glyphs or something rather than a problem with the model.  I'm debugging it now but it is tough going. The code is exactly what you'd expect from a massive C program from 1985 worked on by multiple researcher-types over the past 40 years...

Reply all
Reply to author
Forward
0 new messages