Training Tesseract for new fonts

176 views
Skip to first unread message

Umanda Dikwatta

unread,
Oct 6, 2022, 1:37:29 AM10/6/22
to tesseract-ocr
Hello,

I've been using Tesseract 4.1 for some time. I am using Tesseract with Sinhala language. I got good results for most of the images I tried. I trained Tesseract with different fonts. But as the documentation says, I had to preprocess my images to obtain good results. 

Then I tried Tesseract 5 with line images as .tif and the labels as .gt.txt. Then I used the generated .traineddata file to extract the text. But that didn't give me good results. I used image processing segmentation to obtain line images. Is it wrong to obtain line images using python segmentation? 

Could someone please explain me the possible reason?

Thank you very much

Saman Kurdi

unread,
Oct 6, 2022, 1:41:03 AM10/6/22
to tesser...@googlegroups.com
Hello,

This might help.

Refards.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40a95c6f-b459-4937-930f-1eb103bc4f82n%40googlegroups.com.

Umanda Dikwatta

unread,
Oct 6, 2022, 3:05:59 AM10/6/22
to tesser...@googlegroups.com
Thank you very much for the link. Can we use non-unicode fonts as well? I have attached a sinhala font that I'm struggling to train. 

Thank you very much

apex_a.pura-042.ttf
Reply all
Reply to author
Forward
0 new messages