tesseract failing on extremely simple example

198 views
Skip to first unread message

Marvin Thielk

unread,
Mar 27, 2021, 10:40:46 AM3/27/21
to tesseract-ocr
I've tried a variety of pre-processing attempts and different configs, but this feels like it should be an easy detection task.

I've tried with several different psm and oem settings. Even restricting to numerical characters. Nothing seems to help.

Is the next step to re-train it?

version info if it helps:
tesseract v5.0.0-alpha.20201127
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
 Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0
717.png

Shree Devi Kumar

unread,
Mar 27, 2021, 1:50:46 PM3/27/21
to tesseract-ocr
Do you have the font used in the sample?
Do you only need to recognise numbers in it?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1bb67d51-2bd3-4d4e-9ba1-8b39b7f3ee43n%40googlegroups.com.

Marvin Thielk

unread,
Mar 27, 2021, 4:44:35 PM3/27/21
to tesseract-ocr
 I do have the font available as a ttf file. It is probably copyright protected but I could post it if it would be useful.
No I need to recognize letters and numbers, and I've been able to extract text from other regions of the images, its just this region of numbers and .%'s

Thanks,
~Marvin

Shree Devi Kumar

unread,
Mar 28, 2021, 2:16:46 PM3/28/21
to tesseract-ocr
Finetuning with font will help.

I retrained using "Oleo Script Swash Caps Bold" font which had numerals similar to the test image. And the numbers get recognized now.

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png -
V7
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 717-300.png - --tessdata-dir /home/ubuntu/tesstrain/data/   -l engtuned
Failed to load any lstm-specific dictionaries for lang engtuned!!
717

Finetuned traineddata File is attached.

engtuned.traineddata

marvin thielk

unread,
Mar 30, 2021, 10:07:34 PM3/30/21
to tesser...@googlegroups.com
oops, missed this delivery failure. The ttf file is too large to attach because it contains asian characters. I can upload it somewhere if you're interested, but I plan on training a model for my own edification. Original message below:

This is awesome, thank you so much!

What hyperparameters did you use for training? number of pages? epochs?

Which model did you start with? your file seems smaller than other eng.traineddata files.

Thanks,
~Marvin

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/j3An1bBB_S0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJRvd%2Bbf%2B1HgCPNmtFLO%3Dk_8-xZOEVd%2BMEEqzjaF_hkQ%40mail.gmail.com.


--
Marvin Thielk
Neuroscience PhD candidate at UCSD

Shree Devi Kumar

unread,
Mar 31, 2021, 2:56:47 AM3/31/21
to tesseract-ocr
I did fine-tuning with the eng.traineddata, using about 200 text lines from the training text and 1100 iterations , CER of 0.01. The resulting model is small because it does not have the dictionary files and is compressed to fast/integer model.

Reply all
Reply to author
Forward
0 new messages