OCR-D training process - High error rate [Tess 4]

Joe

unread,

Jul 4, 2018, 10:50:54 AM7/4/18

to tesseract-ocr

Hi everybody!

I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without success so far. Tesseract and Leptonica are installed by the scripts.

Inspired by the test set provided in that repo, I created pairs of [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text lines in total).

You can see an example of my set in attachment that also contains files created by the training process.

My guess is that something is wrong with my data.

Sometimes I can see the char train value increasing instead of decreasing and the final error rate still too high (about 60%).

That new training process with LSTM is driving me crazy!

I would appreciate if anyone with experience could take a look to my data set.

Joe.

data.zip

Joe

unread,

Jul 4, 2018, 11:03:27 AM7/4/18

to tesseract-ocr

I forgot to mention:

The *.box files created by OCR-D are not in the same format as described in https://github.com/tesseract-ocr/tesseract/wiki/Making-Box-Files---4.0
I know Tesseract 4 boxes only need to cover a text line instead of individual chars, but in the example given in that link every character box value is different while in *.box files created by OCR-D the all have the same values.

Is that a problem?

Lorenzo Bolzani

unread,

Jul 4, 2018, 11:33:41 AM7/4/18

to tesser...@googlegroups.com

I had no problems training with the ocr-d boxes. Looking at the tiffs the first thing I'd try to do is adding some white border on left and right.

For my training I used no-binarized (grayscale) data and I think it could be better (more information is available).

Are you training from scratch of fine tuning a model? How many epochs did you do? How long did it run? Maybe you just need to wait more.

Please, have a look at this thread too:

https://groups.google.com/forum/#!topic/tesseract-ocr/be4-rjvY2tQ

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/601364b4-3ebd-4a04-9f6a-3d418ab728ab%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Joe

unread,

Jul 4, 2018, 12:13:56 PM7/4/18

to tesseract-ocr

Thank you for your answer, Lorenzo!

I was following the sample data provided by ocr-d and I realized every tiff in ocrd-testset.zip has no left or right white border. That's why my tiffs are the same way.

Anyway I'll give it a try with some space and with no-binarized data.

I'm training from scratch and I used the 10000 iterations given by default by ocr-d (then I tried with 20K/30K but only with slightly better results). The training process takes about 2-3 hours to complete (4-5h with 20K iterations).

This is the best result a got:

After that with more iterations the char train value remains almost the same and sometimes it ends up bigger.

The thread you commented about only refers to fine tuning, so I'd probably use it later. Thank you once again!

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lorenzo Bolzani

unread,

Jul 4, 2018, 12:39:41 PM7/4/18

to tesser...@googlegroups.com

I suspect 1800 lines may not be enough data for training from scratch and you are simply overfitting. I think 5% refers to the evaluation set, with a default split 80/20 I think.

Try this to check the accuracy on the training set and the eval set:

lstmeval --model your-model.traineddata --eval_listfile data/list.train

lstmeval --model your-model.traineddata --eval_listfile data/list.eval

If the train accuracy is much lower, like 0.1% or even 2%, you are overfitting: too little data and/or a model too large.

If so, you may add more different data (I guess at least 10 times or more), also try some augmentation even if I think you already do.

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55041513-f089-4a18-b712-7daed030da01%40googlegroups.com.

Joe

unread,

Jul 7, 2018, 10:35:49 AM7/7/18

to tesseract-ocr

Hi, Lorenzo!

Thank you for your tips!

When I run those check commands I get this:

I'm gathering more data and as soon as I get any result I will share it here.

Have a nice weekend!

Joe.

Lorenzo Bolzani

unread,

Jul 7, 2018, 12:41:14 PM7/7/18

to tesser...@googlegroups.com

I never had this. It's strange that you are getting this now and not during the training.

I would check the location I'm running the command from, I mean, that data/train/...lstmf is there, in the correct relative place.

Second I would check the lstmf file size. Then I would inspect the tiff and gt.txt files the lstmf was generated from to see if they are empty, missing, wrong, etc.

When I have these doubts I delete the box, lstmf, etc., and let ocr-d recreate everything.

Or maybe there is something wrong with the training data, this is another possible reason for training improving for a while and then get stuck.

Lorenzo

Lorenzo Bolzani

unread,

Jul 8, 2018, 7:28:43 AM7/8/18

to tesser...@googlegroups.com

About the white border, maybe my suggestion was not so good.

I've seen that sometimes adding some generous white border during recognition helps a lot (both with characters recognition and characters splitting).

But I'm also seeing that training with a border and doing recognition with a different sized one gives a lot of errors.

I suppose that the white border may somehow compensate for a mismatch between the real data and the training data (or creating it).

So it's probably better to train with a very small border (or none?), anyway use the same you will use with your real data (or do a little "border augmentation", like 1px or 2px).

Bye

Lorenzo

Ramakant Kushwaha

unread,

Jul 17, 2018, 1:10:55 PM7/17/18

to tesseract-ocr

Hi,

I am also trying to train Tesseract 4.0 for hand written digits, I want to know what is the best way to create pairs of [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text lines in total) . Are you using any specific tool to generate *.tif and *.gt.txt files.