How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

697 views
Skip to first unread message

sai sumanth Kalluri

unread,
Jul 9, 2019, 2:22:28 AM7/9/19
to tesseract-ocr
Hi!

I'm trying to teach tesseract to recognize a particularly tricky font of the english language (I do not know the name of the font and any online tool couldn't find it as well) and I have a very high accuracy requirement.It is completely okay if my model does not generalize to other fonts and works only on this font. Following are the details about what I've done so far.

-I'm using: tesseract 5.0.0-alpha-174-g60b4c
                leptonica-1.78.0
                libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
                Found AVX2
                Found AVX
                Found SSE
- I have approx. 6000 lines of training data, each line has around 12-15 words. I'm guessing around 1 in 50 lines has a mislabelled character (how much does that affect the result?).
- I'm fine-tuning the 'eng.traineddata' best model using this data.
- The training as well as the testing data are properly scanned document images in jpg format so I'm assuming any data preprocessing is not required.
- Also when I apply the end trained model to a document with approx. 50 lines of text, I believe the error rate is definitely higher than what lstemeval is telling me.
- I have trained tesseract on this data incrementally from 300 iterations to 6000 iterations and the best I could achieve was after 4200 iterations: Eval Char error rate=0.70714604, Word error rate=1.922281
- After that it has more or less saturated and I even suspect overfitting from the kind of errors its making.
 - I need to achieve ~0.1 char error rate. What can be my next steps? (it is possible for me to create more training data if thats and option but i would prefer something simpler, changing network parameter perhaps?).

(NOTE: The font is indeed very tricky sometimes even for the human eye and I have attached a small sample of it with this post)
Thanks in Advance!

(PROBABLY UNNECESSARY DETAIL: full-stops(.) and commas(,) are very frequently mis-labelled in the training data but I really don't care about puntuation for my project, I only want accurate detection of the other characters. should I be worrying about this?)
test.jpg

sai sumanth Kalluri

unread,
Jul 11, 2019, 2:28:43 AM7/11/19
to tesseract-ocr
Can somebody please give me some advice regarding this?

Shree Devi Kumar

unread,
Jul 11, 2019, 3:04:38 AM7/11/19
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/35abc1cd-552b-405c-85be-9e0af720b04d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sai sumanth Kalluri

unread,
Jul 11, 2019, 3:30:39 AM7/11/19
to tesseract-ocr
Thanks for the reply but that link does not lead anywhere. Could you please correct it?
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Jul 11, 2019, 4:34:01 AM7/11/19
to tesser...@googlegroups.com
Search the forum for Cursive

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lorenzo Bolzani

unread,
Jul 11, 2019, 9:28:14 AM7/11/19
to tesser...@googlegroups.com
Hi, a few things I would try (I never trained on cursive fonts):

- I would use a stable tesseract version (4.1 right now)
- 0.7 is not a very good score for a text this clean
- I think 6000 lines is not much, hard to tell if it is enough, this is not a classic font
- data pre processing may help, but the sample looks perfectly clean. This is already processed.
- How much testing data did you use? 20%? Real world accuracy will always be a little worse than testing accuracy because you pick the model that best fits the test dataset. But do not trust your guts on this difference, it's very hard to estimate it informally. Make sure the real document is processed in the same way as the training/test data
- do some data augmentation: bold, noise, stretch, skew, blur, tiny rotations, etc. to generate more data (not too much, maybe 3 to 5 times more), also keep the original data. If you use python you can use imgaug.
- if you can find the font, it should be possible, add some synthetic data too (again with augmentation). There are online tools to find fonts by samples.
- small labels errors are not a big problem if you have a lot of data and if you do not overfit too much. In this case you can first train one model with current data, then use it to tell you which samples do not match the gt.txt files according to this model. It will likely find most of the mislabeled data. Fix it and then of course train again on the new data. If this is english text you could even run a spell check on the gt.txt files to find some errors.
- restrict the output charset only to the characters you need
- there is some "noise/dust" around the text, probably it is just the jpeg compression, I would apply a simple threshold and save the files as png. Noise should not be a problem if it is present in the training data and prediction data but maybe you are getting this extra noise because you saved the file on disk and maybe at runtime you won't have it. Maybe tesseract will remove it for you, but if you want to remove a source of doubt just threshold them.
- check the boxes of the recognized text to understand what is going on (see ocr_boxes.py or maybe hocr output)

- Your text has long/tall legs, the body is 35px but it goes up to 120 with the legs. So I think it is important to understand how your lines are cropped. The input size for the LSTM is 48(*) so if you feed lines 120px tall these are going to be downscaled a lot and the core part will suffer most. So maybe (just speculating) it is better to cut a little the "legs" and the top (see the example). In any case I'd try to understand what images are fed to the NN at training time and prediction time.

- your text is aligned extremely well, it does not look like something out of a scanner. Is this real scanned text?
- as this is English text, consider doing a dictionary spell check/fix.

- maybe also consider to try to train from scratch using only a lot of synthetic data with very similar fonts only, then fine tune with real data (if you have enough time)



(*) According to this page:


input size, for the "fast" model is 36 or 48, I suppose it is 48 for all the "best" models.



Lorenzo



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
ocr_boxes.py
cropped.jpg

sai sumanth Kalluri

unread,
Nov 22, 2019, 9:34:31 AM11/22/19
to tesser...@googlegroups.com
Hi,
How do I roll back to Version 4.1?

Sumanth

Reply all
Reply to author
Forward
0 new messages