Can it be overtrained?

90 views
Skip to first unread message

Jerry Deng

unread,
Oct 16, 2017, 1:47:23 PM10/16/17
to tesseract-ocr
Newbie here, any help is appreciated.  I'm using some handwriting data to Fine Tune train the english language model that I extracted from eng.traineddata file.  Prepared with box file and lstmf file.  It works with small test.  When I actually run it on 1500+ lstmf file, it works fine when I keep the max_iteration to be under 2200 or so.  As soon as I went over some threshold amount, the model suddenly became unusable and and spit out only CAPITAL letters with some odd punctuations (and the error rate shoot over 100%).  One time it even failed due to a segmentation fault.  Does it sound like it's running out of memory or what are the possible causes?  

ShreeDevi Kumar

unread,
Oct 17, 2017, 10:37:10 AM10/17/17
to tesser...@googlegroups.com
Yes. As mentioned in the wiki regarding 4.0 training, it is very easy to overtrain if using large number of iterations for finetuning.

Please read the wiki page for more details.

On 16-Oct-2017 11:17 PM, "Jerry Deng" <jerry...@digitalscientists.com> wrote:
Newbie here, any help is appreciated.  I'm using some handwriting data to Fine Tune train the english language model that I extracted from eng.traineddata file.  Prepared with box file and lstmf file.  It works with small test.  When I actually run it on 1500+ lstmf file, it works fine when I keep the max_iteration to be under 2200 or so.  As soon as I went over some threshold amount, the model suddenly became unusable and and spit out only CAPITAL letters with some odd punctuations (and the error rate shoot over 100%).  One time it even failed due to a segmentation fault.  Does it sound like it's running out of memory or what are the possible causes?  

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/84abde41-209b-493b-8c07-d6d9ea9fb33a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages