What are Langdata repository given for retraining Tesseract

Venkatapathy S

unread,

Apr 15, 2021, 12:52:32 AM4/15/21

to tesseract-ocr

Hi,

I want to retrain Tesseract from the scratch for a particular language(I have read as many resources as possible, including warnings, from the Tutorial, Github and this forum). Now to begin (and to get myself familiar with the process), I was trying to start with the English language. When I was going through the langdata files(https://github.com/tesseract-ocr/langdata) for English I found out that the training text contains only 72 lines. Does the training text provided in the langdata repository given as a sample text or is it exactly the same set used to train the default eng.traineddata model provided by the tesseract? Can someone help me with this, please?

Regards,

Venkat

https://sites.google.com/view/venkatapathy/home

Shree Devi Kumar

unread,

Apr 15, 2021, 6:16:07 AM4/15/21

to tesseract-ocr

Use langdata_lstm repo for LSTM training. That has larger training text.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f588dfc-5c8b-400a-96c5-65c547f27d46n%40googlegroups.com.

Venkatapathy S

unread,

Apr 16, 2021, 1:27:57 AM4/16/21

to tesseract-ocr

Thank you that was helpful. So is it the same training set used for creating the default traindeddata files available in the repo?

Duy Khanh

unread,

Jun 20, 2023, 11:08:56 PM6/20/23

to tesseract-ocr

Hi! Do you have the answer yet? Cause I am currently looking for it :D

Vào lúc 12:27:57 UTC+7 ngày Thứ Sáu, 16 tháng 4, 2021, venkat...@gmail.com đã viết:

Reply all

Reply to author

Forward