I follow this link to retrain tesseract with my image dataset (I
retrain tesseract with real image, not from text file via tesstrain.sh)
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata
It is my steps to retrain tesseract lstm:
Step1: I create my training data (tif image + box file) from my images.
I generated its via this command line: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
Step2: I edit manually by Qt-box-editor. (I done with this link: https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files)
So now I have files:
.tif file
.box file
.lstmf file (generated by command: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] lstm.train
unicharset file
Step 3: I create .traineddata via this command:
combine_lang_model --input_unicharset unicharset --script_dir langdata --output_dir output --lang "eng"
With langdata I downloaded from here: https://github.com/tesseract-ocr/langdata
Step4: I extract existing model from exist traineddata by command:
combine_tessdata -e /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata eng.lstm
Step5: I retrain tesseract (Fine Tuning for ± a few characters: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters) by command:
lstmtraining --model_output output_model --continue_from eng.lstm
--traineddata output_basic/eng/eng.traineddata --old_traineddata
/usr/share tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile
eng.training_files.txt --debug_interval -1 --max_iterations 400
I try to retrain tesseract with from real image (not from text file via tesstrain.sh)
Please share me something if you have any idea to fix it.
Thank you for advance !
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d08df2e0-ccc3-49bc-90ab-6588f9ab6ef3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Can you share the content of "eng.training_files.txt" file? that --train_listfile argument refers to?Thanks.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b543c0b5-a0d0-44e7-bc63-13b6b06fbadd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/779e6432-8514-4f87-b6d6-68a04b536cf9%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f89a3852-3d89-477f-ad58-6cf2cea12aab%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAH1O8a9-M4dMtZj0k6CgHnQU_bO88mmLWqZUCFm5iDGRjK1_gw%40mail.gmail.com.