training/tesstrain.sh \--fonts_dir /usr/share/fonts \--training_text ../langdata/ara/ara.training_text \--langdata_dir ../langdata \--tessdata_dir ./tessdata \--lang ara \--linedata_only \--noextract_font_properties \--exposures "0" \--fontlist "Arial" \--output_dir ~/tesstutorial/aratest
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \--noextract_font_properties --langdata_dir ../langdata \--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
./langdata./langdata/eng./langdata/ara./tessdata./tesseract./tesseract/tessdata./tesseract/tessdata/configs/./tesseract/trainingetc
combine_tessdata -e ../tessdata/ara.traineddata \~/tesstutorial/aratuned_from_ara/ara.lstm
2)In the above example, I couldn't have an idea why it should take --tessdata because it seems irrelevant to making training data.
3)It says the reader should place each projects like this./langdata./langdata/eng./langdata/ara./tessdata./tesseract./tesseract/tessdata./tesseract/tessdata/configs/./tesseract/trainingetc
and all the following examples are run under tesseract directory. Then I think the examples should take ../tessdata as --tessdata_dir but ./tessdata. I mean the examples should be fixed.
4)combine_tessdata -e ../tessdata/ara.traineddata \~/tesstutorial/aratuned_from_ara/ara.lstmThis is explained as it extracts the existing LSTM model for Arabic from tessdata but how come?The combine_tessdata commands extracts LSTM model because the extension of the second parameter is .lstm?
Another question here is why LSTM model is mixed in the traineddata? I think the traineddata file mixes legacy trained model and LSTM model and I am wondering why they aren't separated? Even if the user only uses LSTM both trained model are read? (is it light-weight? then it might be ok)
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
NOTE Tesseract 4.00 will now run happily with a traineddata file that contains just lang.lstm.The lstm-*-dawgs are optional, and none of the other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. No bigrams, unichar ambigs or any of the other files are needed or even have any effect if present.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com.
Fine Tune will work if all you want to change is a font, with the same unicharset. This works well for Latin script based languages but not complex scripts.