Training Tesseract for new fonts

Umanda Dikwatta

unread,

Oct 6, 2022, 1:37:29 AM10/6/22

to tesseract-ocr

Hello,

I've been using Tesseract 4.1 for some time. I am using Tesseract with Sinhala language. I got good results for most of the images I tried. I trained Tesseract with different fonts. But as the documentation says, I had to preprocess my images to obtain good results.

Then I tried Tesseract 5 with line images as .tif and the labels as .gt.txt. Then I used the generated .traineddata file to extract the text. But that didn't give me good results. I used image processing segmentation to obtain line images. Is it wrong to obtain line images using python segmentation?

Could someone please explain me the possible reason?

Thank you very much

Saman Kurdi

unread,

Oct 6, 2022, 1:41:03 AM10/6/22

to tesser...@googlegroups.com

Hello,

This might help.

https://www.mdpi.com/2076-3417/11/20/9752

Refards.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40a95c6f-b459-4937-930f-1eb103bc4f82n%40googlegroups.com.

Umanda Dikwatta

unread,

Oct 6, 2022, 3:05:59 AM10/6/22

to tesser...@googlegroups.com

Thank you very much for the link. Can we use non-unicode fonts as well? I have attached a sinhala font that I'm struggling to train.

Thank you very much

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAH4VOMLc9f9choNcjUkJVNSt%3DHJazzxBNb-MfDeLvwVUqDMO7Q%40mail.gmail.com.

apex_a.pura-042.ttf

Reply all

Reply to author

Forward