Guide me on training or better/practical pre-processing?

77 views
Skip to first unread message

John Roxton

unread,
Jun 17, 2024, 12:16:51 PM (12 days ago) Jun 17
to tesseract-ocr
I'm using Tesseract 5.3.3

My use-case is to perform OCR on username strings captured from various ROIs of screenshots.  These strings are 5-12 characters in length and make use of a set of allowable characters consisting of:  A-Za-z0-9._

In general, it seems that Tesseract already does a pretty good job on my images, but due to the particular font that seems to be used (I believe it is "Droid Sans"), it often struggles with particular characters or character combinations.

The most common mistake it makes is with O (capital o) and 0 (zero).  Another particularly tricky character/combination is with either case of the letter "J" as the "hook" in this letter for this font hangs below the horizon.  It also may mischaracterize a "I" (capital i) for "l" (lowercase L).

I've found that `--psm 6` usually works best for my use-case.

Reading through the `tesseract-ocr` and `tesstrain` documentation, and learning from what I can find elsewhere online, it seems:
- it is recommended that pre-processing images is better than training
- fine-tuning should be preferred over training from scratch

Albeit, I am having great trouble in training my own model.  I have generated 10,000 `.tif` images of text  of assorted string lengths from 5-12 characters utilizing my restricted character set in random combinations using the "Droid Sans" font, along with associated "ground truth" files with matching file names and a `.gt.txt` extension. Additionally, I have many "in-the-field" images (such as those seen below) that I can provide "ground truth" text for.


Here are some particularly tricky images I've encountered:

"CJR21" - often misinterpreted as "R21", "QR21", or "gR21"
CJR21.png

"WPJ777" - Interpreted correctly using `--psm 6`
WPJ777.png

"SenorC0le" - A common case of a "0" (zero) misinterpreted as a capital "O"
SeenorC0le.png

"Iamagod" - capital i misinterpreted as a lowercase LIamagod.png

Example of Tesseract's "internal" pre-processing:
Olympic-seat_4-25-3503-screenshot.processed.png

John Roxton

unread,
Jun 17, 2024, 12:39:23 PM (12 days ago) Jun 17
to tesseract-ocr
I should clarify my issues with training my own model:
I can generate all the needed data, but I simply cannot find a consistent source that can guide me through the LSTM training process.  So, in case anyone is wondering, I have not yet actually successfully trained and tried my own model.  I have produced some .traineddata files that are larger than the default eng.traineddata file, but fail to solve even the few images above.  Furthermore, I cannot seem to replicate the training process!

I will also mention that my solutions for post-processing with some sort of fuzzy-matching process can be useful with longer strings, but fail miserably with the shortest of strings, where the impact of a single character being misinterpreted is more significant.

John Roxton

unread,
Jun 17, 2024, 9:13:10 PM (12 days ago) Jun 17
to tesseract-ocr
Update:
After searching all the threads/discussions and reading posts, I decided to try out the example 'ocrd-testset' that comes with `tesstrain`. Following a recommendation to another user by @zednop, I ran the command `make training MODEL_NAME=ocrd START_MODEL=deu_latf TESSDATA=~/tessdata_best MAX_ITERATIONS=10000` and was able to see significant improvement, which I was able to verify compared to the default model.

Inspired, I tried training my own model (again) using the "Droid Sans" font with random ground-truth text generated from a limited character set ("A-Za-z0-9._"), of variable lengths 5-12 characters, 
with a starting model of the tesseract_best eng.traineddata.  Initially, for the first ~35,000 iterations, training was showing signs of improvement with a BCER decreasing to about 92%.  However, then I noticed the BCER began to rise so I ended the training. Soon after, I continued hoping it wasn't abnormal, but the BCER continued to rise and rise all the way back to a BCER of 99.99%, at which point I ended it and haven't restarted it since.

The AIs tell me it's likely due to "over-fitting".  This is something I don't quite understand, yet.  I am wondering if the arbitrary nature of the text in the test set might be "short-circuiting" the prediction, and if maybe I should disable the dictionary.

Any suggestions?

Ger Hobbelt

unread,
Jun 19, 2024, 6:05:11 AM (10 days ago) Jun 19
to tesseract-ocr
Couple of general notes, some of which I'm sure you already tried:

- all input images: convert to black text on white background. Think: greyscale!, rather than pure binarization. (Pixel values are fed straight into the neural net, so it MAY help to have the lighter pixels near the edge of the glyph/character not being hard pure black as then they "weigh in" differently compared to having the whole thing binarized (your "tesseract internal" sample) -- I now know I effed that one up in a pull request I did some time ago, where Stefan Weil got (slightly) worsening results in his tests versus spot improvements in my own.😰

- preprocessing: font size is a big factor. See also the tesseract docs (response from mobile here, sorry, no link. Also search mailing list if you can: the original research for the "30px" measure comes with a chart. Anyway, bottom line: scale/resize your input images and observe the changing confidence numbers output by tesseract (tsv and hour output; confidence per character is relevant here)

- your input is, by your definition, essentially a fully random input with an Alfabet size of 26+26+10+2 = 64 characters. At "word size" 5-12 that means the acceptable word set ('dictionary') size is sum[5..12]{64^i} which is roughly 4.8E21 which is huge. Even when we reduce the Alfabet to capitals only as a very rough lower estimate trying to account for human behaviour ("nobody starts their login name with a . dot", ...) you're still lookjng at upwards of 26^5 = 12 million words. Hence the obvious 😉 conclusion: disable the dictionary for scenarios such as yours.


- LSTM looks at the input one VERTICAL SCANLINE at a time, while keeping a kind of "memory" of what came before. While I'm still somewhat vague on the precise internal workings, this implies that LSTM also "remembers" the previous characters in the input. 
Sounds like Markov Chain 🤔... 
How far back does that memory go?

 #1: I haven't looked at the tesseract code deep enough (while still grokking what I'm seeing) to know if it actually does a bidirectional LSTM scan (probably it does as that would be more reasonable for rtl language support such as Arabic, apart from published papers mentioning bidirectional has, usually, slightly better prediction performance that unidirectional LSTM), so we must reckon with a left-to-right markov chain plus a right-to-left one influencing this character's estimate...

 #2: "Markov chain" here means the LSTM engine's predictions are not solely based on looking at the current (vertical) SCANLINE, not only at the couple of scanlines that constitute the current character, but previous and subsequent characters' pixels will influence the current prediction! Think "expect u after q in English" (e.g. "query"), that sort of thing. IIRC there was the mention somewhere of '50 scanlines', which, for 30px fonts, would imply at least 2 'historic' characters. Immediately we complicate matters as such simple numbers are only a very rough and shoddy indication of reality: proportional fonts mean '50 lines' span a varying number of characters ('w' is about as wide as 'iii', f.e.), plus, more importantly, at least as far as I understand LSTM today, that memory is not hard, as in: "looking at the previous 50 as well", no, it's more like, what in finance models is called an EMA (exponential moving average): an LSTM keeps a kind of running summary of what came before, and thus the "history" has a tail into infinity (in both directions), where farther away previously seen characters/scanlines have much less impact than the ones that just came before, i.e. are very close to the current scanline.

All that theory leads to the conclusion that an LSTM neural net shares characteristics with a Markov chain. 

For you, that's relevant as tesseract was trained on a dictionary (constructed by Ray Smith at Google) and if my numbers are anywhere near reality, this means you must (a) unlearn some of the teachings the LSTM learned(encoded) based on that original training set as it also encoded that "q is always followed by u" stuff thanks to being trained on a very large human language words dictionary (plus some extras like values etc.) and (b) the first rough estimate for your own training set, given your alfabet, the near-pure-random sequencing for your "words" and the above very crude Markov chain impact length estimate of 2 historic characters, both ways if the LSTM is bidirectional, thus leading to a 2+1(current char)+2 = 5 character wide word segment for training a net like that, thus requiring a training set of alfabetSize ^ relevantSegmentSizeEstimate = 64^5 =~ 1 billion (1E9) words (!) in your training set to ensure the LSTM won't be negatively surprised by unexpected inputs afterwards.

That's a heck of a lot of training to do (thanks to the login name can be anything rule), unless someone can show me the errors in my reasoning+calculus. (I'm here to learn 🙏)

So ... declaring that "unfeasible" we're going for less safe, less accuracy, less work....

Since your font, as you already noted yourself, has shapes that, like ligatures, blend and overlap when considered from the vertical scanline perspective and the first adjacent neighbour scanline (your "CJ" examples!), the absolute minimum count is 2 instead of 5. Slightly better is 3, accounting for the alleged bidirectionality. Which means your training set size is, minimum, 64^2 = 4096 words (all 2 character permutations of your alfabet) or 64^3 = 262144 combos of 3 characters.

Any combo not trained is (with regrettable high probability) a combo that won't be recognized.

I don't know how much it matters if or how you combine those 256K 3char combos in your actual training words (I have yet to tackle with tesseract training myself); all I've seen is that not-in-training-set of certain combos is detrimental to (tesseract) prediction confidence -- an example of which was a gentleman earlier this year who was ocr-ing Dutch VAT numbers: oops, those start with "NL" and then it often enough is 000yourpersonaltaxidentificationnumber (with is a VERY blatant security flaw/leak in Dutch VAT, but I digress) and, yup, tesseract very probably never saw that particular "NL000" sequence during training, so recognition was shot to hell as, by the time it hit the first zero, current character confidence dropped below 70% and that's close to some internal threshold where the machine decides: " nah... can't be real. Forget it!" and the result you get is garbage. ouch!

Ergo: make sure at least those 64^2 combos are in your training set.

Thus it was also a very smart move to take the OCRD training as a base: that one shares the certainly-not-a-human-word-in-here characteristic of your usage scenario; not the near-random you've got, but closer than the base set everyone uses for ocr-ing human language texts.




Re Over fitting


The way I read it until recently was "the net learns to match the quirks of your particular training set too well, now it expects those very same quirks everywhere", but that doesn't explain increasing error rates during training.

🤦 What I (think I) missed is the TIMELINE in the training process: overfitting is what happens when this chain of events occurs, and this is how any training is done:

One sample from the training set is taken, fed to the net and results observed and compared against desirables (German language has beautiful words for this in control-engineering: "Istwert" (the value that is) and "Sollwert" (the value that MUST be): the difference between Istwert and Sollwert is used as a factor and a direction towards correcting / adjusting the net (down that rabbit hole: backwards propagation, transfer function differentiability, gradient descent, ...): the larger the difference, the stronger the adjustment. As the net ages, adjustment factors are reduced to help stabilise the thing. Lots more involved, but this is the crude training basics.
Now take any training sample, run a single cycle like that and observe a little error: still not perfect! Hence a tiny adjustment follows, aiming to improve the future outcome for that sample. However, those edge weights which are adjusted in a net are used all the time, for everyone: once such a tiny adjustment due to backdrop for a single sample impacts OTHER samples' predictions negatively, it can be argued that overfitting is starting to occur: while we may match sample X slightly better, we happen to (accidentally, but as a consequence of how a net works, fundamentally) have decreased the confidence and thus quality of prediction for (some) other samples (in the training set). When the next training cycle for those now-worsened samples Y and Z cannot compensate any more by their own net adjustment activities,  then the human / monitor outside this inner training loop may start to notice and that's when we call this behaviour overfitting. It's a gradual thing and the "magic touch" is knowing when to stop training and/or pull other tricks out of your hat, such as switching training schedule/mechanics. (Fun aside: I see Stefan Weil is coding dropping in an experimental branch this month; I'm very curious what results that will produce.😁 This is one of the many ideas out there to counteract overfitting (philosophically speaking from my armchair I'd say the better phrase might be "postponing the moment you cross that pain threshold and call this fitting an overfitting"))

If we take this description of the training timeline into account, overfitting is then a human-chosen spot along the timeline where BCER starts to bend (knee in the curve), or, when you consider a training set to be a subsample of reality, where the BCER (or other metric used as your KPI) starts to level off: if further training does not improve results for known knowns, you're probably worsening for known and unknown unknowns. (was that McNamara I'm paraphrasing? 🤔 Anyhoo ..)


Hope that helps getting a bit better feel about what overfitting might constitute.




And to anyone else, who sees minor or major errors in my blathering: please do correct. Thank you! (I'm not moving forward when  filling the ML and the interwebz with more faulty intel about tesseract et al than already extant.)







--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/02210c67-07d5-48a7-b309-ad3e15148b15n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages