Question reg. Telugu ; char missing in ocr ; how to fix ?

67 views
Skip to first unread message

ravi kumar

unread,
Aug 11, 2023, 1:07:38 PM8/11/23
to tesseract-ocr
Hi ,  
New  to this program.. not  sure how  and where to start  to fix..
i have  a image attached   that is used for testing Tesseract  and H-ocr  file  for trace on missing char ; can  someone interpret   and guide me to the fix. 

TIA,
Ravi Kumar.
untitle1.hocr
Untitled1.jpg

Zdenko Podobny

unread,
Aug 17, 2023, 4:55:47 AM8/17/23
to tesser...@googlegroups.com
Please provide details of what are you doing including details of Tesseract version, OS, and which tessdata you used...)

Make sure you read tesseract documentation and please provide also details on which suggested solution you used and which char is missing (as not everybody is familiar with Telugu)

Zdenko


pi 11. 8. 2023 o 19:07 ravi kumar <rev...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cf266779-e08c-4d8c-b970-738d2ad48084n%40googlegroups.com.

Des Bw

unread,
Sep 8, 2023, 1:23:02 PM9/8/23
to tesseract-ocr
I am on the same boat. I am using the latest version of Tesseract (5.3) on the Mac. The guide has mentioned a way to add (fine tune) missing characters. But, it is so very difficult to follow; has many steps  ; I couldn't wrap my head around it: that I gave up after a couple of attempts. 

How to train Tesseract 4.00 | tessdoc (tesseract-ocr.github.io) the section is: Fine Tuning for ± a few characters

- Fine tuning using the usual methods, from  the existing .traineddata is not working to add the missing characters. 
- I have tried different method to fine tune: by increasing and decreasing iterations, by increasing and decreasing the lines: by feeding many lines of the missing characters, etc, with no avail. 

So, dear Zdenko, can you please tell us on how to fine tune for   new characters, in simple (layman) terms?

Des Bw

unread,
Sep 8, 2023, 1:42:31 PM9/8/23
to tesseract-ocr
If you want to see a practical example, you can look the image I attached here, and the text outputs. 

I am working on Amharic language. The best traineddate produces quite accurate results, except a couple of characters are totally missing from the outputs. 

In the image attached, you can see that there is character ሡ (which I circled in red to for you to see it). This character never occurs in the ocr output across all documents. It either misses completely, or gets replaced by other slightly similar characters. 
p8.tif
Reply all
Reply to author
Forward
0 new messages