tesseract failing on extremely simple example

Marvin Thielk

unread,

Mar 27, 2021, 10:40:46 AM3/27/21

to tesseract-ocr

I've tried a variety of pre-processing attempts and different configs, but this feels like it should be an easy detection task.

I've tried with several different psm and oem settings. Even restricting to numerical characters. Nothing seems to help.

Is the next step to re-train it?

version info if it helps:

tesseract v5.0.0-alpha.20201127

leptonica-1.78.0

libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2

Found AVX

Found FMA

Found SSE

Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0

717.png

Shree Devi Kumar

unread,

Mar 27, 2021, 1:50:46 PM3/27/21

to tesseract-ocr

Do you have the font used in the sample?

Do you only need to recognise numbers in it?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1bb67d51-2bd3-4d4e-9ba1-8b39b7f3ee43n%40googlegroups.com.

Marvin Thielk

unread,

Mar 27, 2021, 4:44:35 PM3/27/21

to tesseract-ocr

I do have the font available as a ttf file. It is probably copyright protected but I could post it if it would be useful.

No I need to recognize letters and numbers, and I've been able to extract text from other regions of the images, its just this region of numbers and .%'s

Thanks,

~Marvin

Shree Devi Kumar

unread,

Mar 28, 2021, 2:16:46 PM3/28/21

to tesseract-ocr

Finetuning with font will help.

I retrained using "Oleo Script Swash Caps Bold" font which had numerals similar to the test image. And the numbers get recognized now.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png -
V7
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png - --tessdata-dir /home/ubuntu/tesstrain/data/ -l engtuned
Failed to load any lstm-specific dictionaries for lang engtuned!!
717

Finetuned traineddata File is attached.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/361e0ed0-c2c6-4a80-8509-31237ae551f4n%40googlegroups.com.

engtuned.traineddata

marvin thielk

unread,

Mar 30, 2021, 10:07:34 PM3/30/21

to tesser...@googlegroups.com

oops, missed this delivery failure. The ttf file is too large to attach because it contains asian characters. I can upload it somewhere if you're interested, but I plan on training a model for my own edification. Original message below:

This is awesome, thank you so much!

What hyperparameters did you use for training? number of pages? epochs?

Which model did you start with? your file seems smaller than other eng.traineddata files.

Thanks,

~Marvin

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/j3An1bBB_S0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJRvd%2Bbf%2B1HgCPNmtFLO%3Dk_8-xZOEVd%2BMEEqzjaF_hkQ%40mail.gmail.com.

--

Marvin Thielk

Neuroscience PhD candidate at UCSD

775 964 8726

Shree Devi Kumar

unread,

Mar 31, 2021, 2:56:47 AM3/31/21

to tesseract-ocr

I did fine-tuning with the eng.traineddata, using about 200 text lines from the training text and 1100 iterations , CER of 0.01. The resulting model is small because it does not have the dictionary files and is compressed to fast/integer model.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAHqNQh7Mkm-%2Bo77gr%3DE0kuzKd%2Bys%3Dct7wH0iYGCq6xZ9G7B4Mw%40mail.gmail.com.

Reply all

Reply to author

Forward