Tesseract config for simple single words text and questions about learning

Lorenzo Blz

unread,

Apr 28, 2018, 8:46:48 AM4/28/18

to tesseract-ocr

Hi, I'm using tesseract to recognize small fragments of text like this (actual images I'm using):

Numers are fixed lenght (7 digits) and letters are always 2 chars uppercase. I'm using a whitelist (a different one depeding if the fragment is text or digits, I know this in advance). And it works reasonable well. The size of these fragments is fixed, I rescale them to the same height (54 pixels, I could change it or add some borders). These are extracted from smartphone pictures so the original resolution varies a lot.

I'm using lang "eng+ita" because in this way I get better results. I'm also using user-patterns but they are not helping much. I'm using the api through tesserocr python bindings.

I think there are may parameters I can fine tune but I tried a few (load_system_dawg, load_freq_dawg, textord_min_linesize) but none of these improved the results (a very small textord_min_linesize=0.2 made them worse, so they are being used). I've read the FAQ and the docs but there are really too many parameters to understand what to change and how.

In particular my current problem is adaptive learning: when I process a large batch of pictures the result varies depending on other fragments. Fragments that are perfectly readable and correctly classified when processed individually, give different, wrong, results when processed in a batch (I mean reusing the same api instance for multiple images).

I tried to disable it but it looks like it cannot be disabled when using multiple languages(?).

If I use only "ita" (and no whitelist, no learning) the first image in this post is recognized as (text [confidence]):

('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [81])

With learning (multiple calls, no whitelist, lang: ita):

('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [90])
('5748788\n\n', [90])

so it improves to a higher confidence (I do not know how much the confidence value matters in real life). It looks like learning is doing something good even with no whitelist (I could use the whitelist anyway, just to be sure, but the starting point looks better).

I'm wondering if I can do some kind of "warmup" with learning enabled and later turn it off (I'll try this today). But how many samples do I need? And it seems a little hacky.

Or maybe there is some way to print debug informations from the learning part to see what parameters are changed and set them manually later (I tried a few debug params but got no output).

Or maybe it is quite easy to manually find good parameters for this kind of regular text to get close to 90 confidence.

On the "AT" fragment I get 89 confidence and I think it may be quite low for this kind of simple clean text.

What I need are (good) consistent results in all situations for the same image. What do you think?

Thanks, bye

Lorenzo

Auto Generated Inline Image 1

Auto Generated Inline Image 2

Auto Generated Inline Image 3

ShreeDevi Kumar

unread,

Apr 30, 2018, 1:50:12 AM4/30/18

to tesser...@googlegroups.com

Try tesseract-4.0.0-beta

I get correct results with it from command line

# tesseract numbers-test.png numbers-test --tessdata-dir ./tessdata_fast -l eng --oem 1 --psm 6

Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica

Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract numbers-test2.png numbers-test2 --tessdata-dir ./tessdata_fast -l eng --oem 1 --psm 6

Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica

Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract letters-test.png letters-test --tessdata-dir ./tessdata_fast -l eng --oem 1 --psm 6

Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica

Warning. Invalid resolution 0 dpi. Using 70 instead.

#

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/563f2458-d63f-4198-8e73-abc448112423%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

letters-test.txt

letters-test.png

numbers-test2.txt

numbers-test2.png

numbers-test.txt

numbers-test.png

Lorenzo Bolzani

unread,

Apr 30, 2018, 4:20:46 AM4/30/18

to tesser...@googlegroups.com

Hello ShreeDevi,

thanks for your answer. I tried to use the 4.0 version but I get a different kind of errors. And, as far as I know, the whitelist is not yet supported in the 4.0 version so I decided to go with the 3.05 because I think this feature can be important in my case.

I updated and built the 4.0 version right now and this is what I get (using the command line you provided) on some of the problematic samples.

tesseract 4.0.0-beta.1-163-gd3f6
leptonica-1.75.3
libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : zlib 1.2.8

Found AVX2
Found AVX
Found SSE

Please notice how a small change in the threshold parameters of binarization influences the result in numbers-test2.png and d.numbers.png. I realize the error on numbers-test2.png is something you do not have.

Using the 3.05 version with learning: after two "warmup" "epochs" on 50 samples (with 2 or 3 errors each), I get 100% accuracy for three epochs on those same samples. After that, sometimes I still get one wrong sample. The result is very good but the fact that it changes "randomly" depending on the provided data is not something I'm very comfortable with in a production environment.

It also makes very very difficult to do fine tuning and evaluate the impact of the changes and the final performances.

This is why I'm considering disabling the learning at the same time I'm happy with the improvements I get and I'd like to reproduce them in a controlled way. Maybe I can print all the parameters after each epoch and see if there are changes but I suspect the fine tuning is internal and does not affect user provided values directly.

Thanks

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW4%2BHvAW-KxVu1PfeTF0ueLu-76K6Gz95tKy2U6bK%2B%3Dag%40mail.gmail.com.

b.numbers.png.txt

b.numbers.png

d.numbers.png

d.numbers.png.txt

f.letters.png

f.letters.png.txt

numbers-test2.png

numbers-test2.png.txt

Reply all

Reply to author

Forward