What are Langdata repository given for retraining Tesseract

70 views
Skip to first unread message

Venkatapathy S

unread,
Apr 15, 2021, 12:52:32 AM4/15/21
to tesseract-ocr
Hi,
I want to retrain Tesseract from the scratch for a particular language(I have read as many resources as possible, including warnings, from the TutorialGithub and this forum). Now to begin (and to get myself familiar with the process), I was trying to start with the English language. When I was going through the langdata files(https://github.com/tesseract-ocr/langdata) for English I found out that the training text contains only 72 lines. Does the training text provided in the langdata repository given as a sample text or is it exactly the same set used to train the default eng.traineddata model provided by the tesseract? Can someone help me with this, please?

Regards,
Venkat

Shree Devi Kumar

unread,
Apr 15, 2021, 6:16:07 AM4/15/21
to tesseract-ocr
Use langdata_lstm repo for LSTM training. That has larger training text.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f588dfc-5c8b-400a-96c5-65c547f27d46n%40googlegroups.com.

Venkatapathy S

unread,
Apr 16, 2021, 1:27:57 AM4/16/21
to tesseract-ocr
Thank you that was helpful. So is it the same training set used for creating the default traindeddata files available in the repo?

Duy Khanh

unread,
Jun 20, 2023, 11:08:56 PM6/20/23
to tesseract-ocr
Hi! Do you have the answer yet? Cause I am currently looking for it :D

Vào lúc 12:27:57 UTC+7 ngày Thứ Sáu, 16 tháng 4, 2021, venkat...@gmail.com đã viết:
Reply all
Reply to author
Forward
0 new messages