Evaluation of model trained with generated text against real-word data

157 views

Skip to first unread message

Inductiveload

unread,

Jul 26, 2021, 1:59:10 AM7/26/21

to tesseract-ocr

Hi,

I am working on training an LSTM model for old-style English printing (i.e. a font somewhat like Caslon, long-s and substantial printing defects). I am hoping to eventually submit to tessdata_contrib.

I have had quite some success with a script to generate line data using a modified version of Adobe Caslon Pro and some noise generation and then training on top of the eng model [1]. This is mostly because I do not want to have to process lines out of thousands of images and correct them all first.

However, because I am training on artificial data, but the actual aim is to
OCR real images, I would like to be able to evaluate the effects of various parameters more objectively. However, I am struggling to figure out how to generate the required data to get an answer from lstmeval. The inputs I have are a directory of images and text files, in the same way that I have a directory of generated images for the ground truth data I am training with.

What is the correct way to generate the required data for running lstmeval manually in this case?

[1]: https://en.wikisource.org/wiki/User:Inductiveload/Tesseract

Inductiveload

unread,

Jul 27, 2021, 8:46:17 AM7/27/21

to tesser...@googlegroups.com

On Mon, 26 Jul 2021 at 06:59, Inductiveload <induct...@gmail.com> wrote:
> What is the correct way to generate the required data for running lstmeval manually in this case?

I did actually figure this out in the end, so in case anyone else in
future is as dumb as me and to avoid anyone trying to answer a solved
problem here's my solution (x-posted at stackoverflow[2])

You can generate the .ltsmf files needed for the evaluation like this,
assuming the evaluation ground-truth is in
tesstrain/data/eval-ground-truth:

cd tesstrain
make lists MODEL_NAME=eval

This will generate a file data/eval/all-lstmf, which contains a list
of all the .lstmf files generated. The list.eval contains only a
subset, as the ground truth corpus is partitioned into evaluation and
training sets (according to RATIO_TRAIN).

You can then run lstmeval:

lstmeval \
--model data/your_model.traineddata \
--eval_listfile data/eval/all-lstmf

Producing something like this (the mistake below was added to the
ground truth of one .gt.txt file to provoke an error for example
purposes):

Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:TThoſe hypocrites that live amongſt us,
OCR :Those hypocrites that live amongst us,
At iteration 0, stage 0, Eval Char error rate=1.282051, Word error rate=8.333333

If there are no errors (as it was in this case), it looks like:

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 0, stage 0, Eval Char error rate=0.000000, Word error rate=0.000000

Cheers!

[2] https://stackoverflow.com/questions/68523440/evaluation-of-a-trained-on-generated-images-tesseract-4-lstm-model-against-real

Reply all

Reply to author

Forward

0 new messages