Hi all,
I am doing my project using Tesseract v4.00, and always getting the traineddata output in the same size after training with my own data.
I suppose that I did not do the steps correctly..
The only data that I provided were:
1. training_text
2. puncs (I just reduced the general punc as provided in tesseract github)
3. numbers
4. wordlists (I made various wordlists for several training, ranging between 100.000 - 2.000.000)
5. font name (I also made various fonts for several training, ranging between 1 - 20 fonts)
The steps that I did were:
1. Made tiff file, unicharset and other complement data using tesstrain.sh
2. Made tiff file, unicharset and other complement data using tesstrain.sh for evaluation
3. Combined unicharset, wordlists, puncs, numbers and version_str to create started traineddata using combine_lang_data ( I am still not confident with the value of version_str though)
4. Trained data using lstmtraining
5. Combined all output file using lstmtraining --continue_from ...
Yet, all of my training ended with same size which is 10.5MB..
Did I do all my steps correctly?
Once, I also trained with modifying WORD_DAWG_FACTOR in language_spesific.sh to 0 and 1, because I want to read the text and match 100% with my wordlists. But, the result also did not satisfy me, some words are not in my wordlists such as "USISUSISU".
Do you know whats the cause?
I really appreciate if anyone can help or suggest any solution.
Thankyou !!