lstmeval shows good result but visualized result looks bad

Phuc

unread,

Jun 13, 2019, 11:05:40 PM6/13/19

to tesseract-ocr

Hi

I am training a model using Tesseract's lstmtraining and get confuse about the result I get. I wonder if I do anything wrong among these steps below:

I create training data .box and .tif following https://github.com/tesseract-ocr/tesseract/issues/2357. Note that an (.box, .tif) pair include multiple text lines
Run the training process using https://github.com/OCR-D/ocrd-train. Since I already have .box file, I simply comment out the line of `generate_line_box.py` inside the Makefile
After training, I use lstmeval to evaluate the model on some evaluation dataset and get the error which is not so bad

But when I use the exact same image on evaluation dataset, and run the prediction using .traineddata and then the result seems to be totally different

I also attach some files of my training data and the visualized result in case anyone wants to take a look

I will be appreciate if someone can tell me what wrong did I do

Thanks

hoge.zip

shree

unread,

Jun 17, 2019, 4:19:17 AM6/17/19

to tesseract-ocr

Your files have prefix of jpn, so I assume you are training for Japanese, but the image in question has only numbers in it.

Getting good results on eval data but bad results on OCR could be the result of overfitting the model, if you have used a small sample and trained for large number of iterations.

phuc...@gmail.com

unread,

Jun 17, 2019, 8:37:51 AM6/17/19

to tesser...@googlegroups.com

Thanks shree for your reply. I see that you are very busy to answer a lot of questions here. Thanks again for taking some time for me

Your files have prefix of jpn, so I assume you are training for Japanese, but the image in question has only numbers in it.

Well I forgot to mention, my model only need to recognize digits, not all of Japanese Character. I just put the prefix of jpn because I am working with Japanese Document

Anw, as your answer I understand that high chance that I am dealing with overfitting problem, not some problem of how to convert check point file to .traineddata file, am I right? If so, I guess the first thing I should try is to finetune your digits model (I found you shared on github https://github.com/Shreeshrii/tessdata_shreetest). Correct me if I am wrong

Btw, I have 2 more questions:

1. About how I generate the training data. Since I could not find the right font for my document, I cropped the digit image from the data I have and randomly pick cropped digit to generate training image. Do you think this is the right way to do the data augmentation?

2. I generated 2000 samples for the training, is it enough or not?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6090eb0-6803-4242-b2e9-9cf27ca65126%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jun 17, 2019, 12:33:11 PM6/17/19

to tesser...@googlegroups.com

I don't think you need training to improve results.

You need to pre-process the image, straighten it. Use a separate tool to identify each cell of data and then OCR that. You will get best results like that.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CACgP0BYPrmMgp6HLKBf4P8oQ7naACaZO0914%3DUQJKi4CzTKn0A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all

Reply to author

Forward