The setup for running tesstrain.sh is the same as for base Tesseract. Use --linedata_onlyoption for LSTM training. Note that it is beneficial to have more training text and make more pages though, as neural nets don't generalize as well and need to train on something similar to what they will be running on. If the target domain is severely limited, then all the dire warnings about needing a lot of training data may not apply, but the network specification may need to be changed.
Training data is created using tesstrain.sh as follows: Note that your fonts location may vary.
training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain
Thank U Very much . I want to reply Everybody
training/tesstrain.sh \--fonts_dir /usr/share/fonts \
--lang eng \
--linedata_only \
--noextract_font_properties \--langdata_dir ../langdata \ --tessdata_dir ./tessdata \ --output_dir ~/tesstutorial/engtrain
You should try to follow the above tutorial for training eng.
You need to make sure the correct paths are given for the various directories.
You should know that tesseract will recognise Korean without training, using existing traineddata.sudo add-apt-repository ppa:alex-p/tesseract-ocr sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-kor
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f1de33d0-e0c4-4d65-88b4-57c92562ea8a%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Hi I'm studying this passage. But I cannot understand what is that meaning flag "--noextract_font_properties" ? . so I saw the file /tesseract/training/tesstrain.sh
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/633868d4-5943-46a5-b584-1a32a89131b7%40googlegroups.com.