I want to retrain Tesseract from the scratch for a particular language(I have read as many resources as possible, including warnings, from the
Tutorial,
Github and this forum). Now to begin (and to get myself familiar with the process), I was trying to start with the English language. When I was going through the langdata files(
https://github.com/tesseract-ocr/langdata) for English I found out that the training text contains only 72 lines. Does the training text provided in the langdata repository given as a sample text or is it exactly the same set used to train the default eng.traineddata model provided by the tesseract? Can someone help me with this, please?