Question reg. Telugu ; char missing in ocr ; how to fix ?

ravi kumar

unread,

Aug 11, 2023, 1:07:38 PM8/11/23

to tesseract-ocr

Hi ,

New to this program.. not sure how and where to start to fix..

i have a image attached that is used for testing Tesseract and H-ocr file for trace on missing char ; can someone interpret and guide me to the fix.

TIA,

Ravi Kumar.

untitle1.hocr

Untitled1.jpg

Zdenko Podobny

unread,

Aug 17, 2023, 4:55:47 AM8/17/23

to tesser...@googlegroups.com

Please provide details of what are you doing including details of Tesseract version, OS, and which tessdata you used...)

Make sure you read tesseract documentation and please provide also details on which suggested solution you used and which char is missing (as not everybody is familiar with Telugu)

Zdenko

pi 11. 8. 2023 o 19:07 ravi kumar <rev...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cf266779-e08c-4d8c-b970-738d2ad48084n%40googlegroups.com.

Des Bw

unread,

Sep 8, 2023, 1:23:02 PM9/8/23

to tesseract-ocr

I am on the same boat. I am using the latest version of Tesseract (5.3) on the Mac. The guide has mentioned a way to add (fine tune) missing characters. But, it is so very difficult to follow; has many steps ; I couldn't wrap my head around it: that I gave up after a couple of attempts.

How to train Tesseract 4.00 | tessdoc (tesseract-ocr.github.io) the section is: Fine Tuning for ± a few characters

- Fine tuning using the usual methods, from the existing .traineddata is not working to add the missing characters.

- I have tried different method to fine tune: by increasing and decreasing iterations, by increasing and decreasing the lines: by feeding many lines of the missing characters, etc, with no avail.

So, dear Zdenko, can you please tell us on how to fine tune for new characters, in simple (layman) terms?

Des Bw

unread,

Sep 8, 2023, 1:42:31 PM9/8/23

to tesseract-ocr

If you want to see a practical example, you can look the image I attached here, and the text outputs.

I am working on Amharic language. The best traineddate produces quite accurate results, except a couple of characters are totally missing from the outputs.

In the image attached, you can see that there is character ሡ (which I circled in red to for you to see it). This character never occurs in the ocr output across all documents. It either misses completely, or gets replaced by other slightly similar characters.

p8.tif

Reply all

Reply to author

Forward