Doubt on "--eval_listfile"

27 views
Skip to first unread message

Fanatico

unread,
Apr 10, 2018, 10:45:59 AM4/10/18
to tesseract-ocr
Platform: MAC OS X
Tesseract: 4.0.0-beta.1-69-g10f4

Wen I execute a command like:

SCROLLVIEW_PATH=~/projects/tesseract/java \
  ~/projects/tesseract/training/lstmtraining \
    --debug_interval 100 \
    --continue_from ~/projects/ocr/training/kortrain/kor_from_full/kor.lstm \
    --traineddata ~/projects/ocr/training/kortrain/new_train/kor/kor.traineddata \
    --append_index 5 \
    --net_spec '[Lfx256 O1c111]' \
    --model_output ~/projects/ocr/training/kortrain/kor_from_full/base \
    --train_listfile ~/projects/ocr/training/kortrain/new_train/kor.training_files.txt \
    --eval_listfile ~/projects/ocr/training/kortrain/eval/kor.training_files.txt \
    --target_error_rate 1 &>~/projects/ocr/training/kortrain/kor_from_full/basetrain.log

I have "--train_listfile" that tells the location of my training files for each font and I have "--eval_listfile" that I suppose is the location for the training files used to test the result of the training.

So my doubt is:
1 - Why I'm training with the fonts "A", "B" and "C" but testing with the fonts "D", "E" and "F"?
2 - And if I need to test using the same fonts, then why do I need to pass 2 times the same file?

ShreeDevi Kumar

unread,
Apr 10, 2018, 10:52:17 AM4/10/18
to tesser...@googlegroups.com
To make sure that the model is not overfitted to training data, your eval set should be different.

You can use a different text file, different fonts from the training set to check that the model performs well on text and fonts it has not seen earlier.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/532b2514-ff7d-4c2c-998a-d61a2aee653a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fanatico

unread,
Apr 10, 2018, 11:39:54 AM4/10/18
to tesseract-ocr
I see, thanks for the reply.

Fanatico

unread,
Apr 10, 2018, 12:30:05 PM4/10/18
to tesseract-ocr
I just thought, but can I pass only the ".training_text" file as a param ?
like --training_text

Fanatico

unread,
Apr 10, 2018, 12:31:32 PM4/10/18
to tesseract-ocr
wen I asked about passing the ".training_text" as a param, I meant in the creation of the training data "training/tesstrain.sh"

ShreeDevi Kumar

unread,
Apr 10, 2018, 12:41:55 PM4/10/18
to tesser...@googlegroups.com
Yes, and you can use different text files for training and eval.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages