Training Tessearct for custom data --Urgent Help Required

avinash singh

unread,

Mar 13, 2021, 2:11:24 PM3/13/21

to tesseract-ocr

Hello,

We are working on a project for underprivileged kids, we need to build an OCR for the Malayalam language.

We downloaded some online training data available for the language Malayalam, the current accuracy is around 60%, we found that few special characters in the language are not picked up by the training data properly.

So we wanted to fine-tune the current training data, we did some research and then downloaded Jtessbox editor for creating training data but we couldn't edit the incorrect character.

then we tried the QT-Box editor, we were able to edit the incorrect letters but we couldn't generate the training data through the software

Finally, we tried Cygwin with the command line to generate the custom data but we failed to combine the training data

As this is for an NGO our company wants to close this project with the current achieved 60% accuracy, I really wish to complete this as the English translation is completely wrong can someone please guide us on how to train the data

Any help would be much appreciated

Thanks in advance

Shree Devi Kumar

unread,

Mar 13, 2021, 7:39:17 PM3/13/21

to tesseract-ocr

You have not stated the version of tesseract that you are using.

>We downloaded some online training data available for the language Malayalam

You have not mentioned from where you got it. Are these the official traineddata files?

>we found that few special characters in the language are not picked up by the training data properly.

Which characters?

>Current achieved 60% accuracy

With the LSTM engine, better results are expected.

Please share a sample image with its expected result.

You can also try

https://ocr.sanskritdictionary.com/

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/84a6fc1f-300a-4aac-85b8-99c47a7d88f4n%40googlegroups.com.

shree

unread,

Mar 15, 2021, 4:49:42 AM3/15/21

to tesseract-ocr

See attached image from a screenshot of Malayalam wiki and the OCRed text using traineddata from tessdata_best, tessdata_fast and tessdata

To me it seems like recognition is 90+% correct.

malayalam-test-fast.txt

malayalam-test-best-int.txt

malayalam-test.gt.txt

malayalam-test.png

malayalam-test-best.txt

avinash singh

unread,

Mar 15, 2021, 10:44:54 AM3/15/21

to tesseract-ocr

Hello Shree,

Thank you for your reply,

We have used tesseract 4.0 alpha

The Training Data is used from the below

https://github.com/tesseract-ocr/tessdata_best

https://tesseract-ocr.github.io/tessdoc/Data-Files.html

Sharing a doc with the results of the tesseract 4.0 alpha for the same image you shared and the expected results.

Also, please let us know if there is any method to fine-tune the incorrect characters.

Required Accuracy and current data.docx

avinash singh

unread,

Mar 19, 2021, 5:01:20 AM3/19/21

to tesser...@googlegroups.com

Hello Shree,

Thank you for your reply,

We have used tesseract 4.0 alpha

The Training Data is used from the below

https://github.com/tesseract-ocr/tessdata_best

https://tesseract-ocr.github.io/tessdoc/Data-Files.html

Sharing a doc with the results of the tesseract 4.0 alpha for the same image you shared and the expected results.

Also, please let us know if there is any method to fine-tune the incorrect characters.

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/mw7kSw4DbqE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/95b01d1a-3b3d-4ade-8b98-80fa57eb30b0n%40googlegroups.com.

Shree Devi Kumar

unread,

Mar 20, 2021, 4:57:57 AM3/20/21

to tesseract-ocr

Yes, finetuning can be done. Please see https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#tutorial-guide-to-lstmtraining

If you already have scanned images and their box files you can also try makefile based training using the tesstrain repo.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAF_YBQRCkY4vXeH_%3Dnf%3D%2BNSOHh-GH6ey9t0DWq6N9LY5Qk%3D8jw%40mail.gmail.com.

Reply all

Reply to author

Forward