Not getting results with numbers and currency simbols in tables

1,204 views
Skip to first unread message

Emiliano Isaza Villamizar

unread,
Jul 25, 2018, 10:49:26 AM7/25/18
to tesseract-ocr
Hello,

I'm trying to train tesseract to accurately extract information from a table. Initialy when running with pytesseract I get these results:

pytesseract.image_to_string(img, lang='eng', config='--psm 11 --oem 1 -c tessedit_char_whitelist=0123456789')

I get these results:

ground truth                            Tesseract  

CN¥6.94 CN#6.94

¥31660.90 ¥31660.90

Ltd Lid


I retrained tesseract with OCR-D, I extracted each cell and wrote the ground truth for 3 tables that add up to 300 cells (300 labeled images). I ran it for 15000 iterations and got an error of 0.5%. But now I get worse results. Tesseract doesn't seem to read numbers and basic acronyms.attached you may find an example of an image used for training.

ground truth                              New tesseract

000426.China                            ooo426.cin

How can I improve tesseract to read these weird characters? I already tried to improve the image quality by transforming the image using CV2 this is an example:


th3 = cv2.adaptiveThreshold(img_grey,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,2) img_grey = cv2.cvtColor(atable, cv2.COLOR_BGR2GRAY)


Thanks!!

cells1a.gt.txt
cells1a.tiff

Lorenzo Bolzani

unread,
Jul 26, 2018, 6:46:44 AM7/26/18
to tesser...@googlegroups.com


Then check the data/unicharset file to see if everything is ok, if there are all the characters you want.


Then, 15000 iterations are way too many and 300 samples are really too few. If you train too much you'll get worse results.

I usually get the best fine tuning results from 400 to 2000 iterations. I can do more, up to 20k iterations, only when I have many sample images: a few thousand with multiple words.


I do it like this (this is not a complete guide, just to give you the general idea):

-
 clean the data and data/checkpoints folders (do NOT add -rf, you do not want to wipe out the training data)

rm data/*

rm data/checkpoints/*


(do this only once, when you start a new training session, not after each training step)

-
go into the Makefile and fix this (in the "data/list.eval" block, remove the + before $$no):


     tail -n "$$no" $(ALL_LSTMF) > "$@"


then add somewhere at the top:

ITERATIONS=100

and change the max_iterations line to this (do not change the tabs/spaces at the beginning, just replace the number):

--max_iterations $(ITERATIONS)

- now run the training as normal like this:

make training ITERATIONS=100

- when it finishes run this:

lstmeval --model data/YOUR_MODEL.traineddata --eval_listfile data/list.eval

In the last line you'll get something like this:

At iteration 0, stage 0, Eval Char error rate=0.96153846, Word error rate=3.8461538

These are the only values that matter. Take note of these values and the iteration numbers.

Make a backup of the model:

cp data/YOUR_MODEL.traineddata data/YOUR_MODEL.traineddata_100

- Now start the training again with ITERATIONS=200, it will resume from the previous iteration up to 200:

make training ITERATIONS=200

- Run lstmeval again, take note, backup and so on, 300, 400, 500....

You should see that the error rate will go down for a while then it will slow down and then will start to get worse. Use the model where you got the best score.

You can try this, but 300 samples are likely way too few for this to be meaningful.

I'm attaching my training scripts, they should work but double check everything.


About thresholding, probably you do not need it, just increase the contrast a little, do not go binary. Probably you do not need that either. And do the same processing to the training data that you will do on your real data.

Two important things, for training and recognition. Use PSM=13 (PSM.RAW_LINE). Trim all the white borders, upscale the image so that the text is 30-50 pixels tall.

Again, train with the same processing you'll use for recognition.


Bye

Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1b05ace0-4ca6-4caf-94a8-d53f7c0bec35%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

train-multi.sh
monitor-eval.sh
train.sh

Emiliano Isaza Villamizar

unread,
Jul 30, 2018, 11:19:23 AM7/30/18
to tesseract-ocr
Lorenzo, Thank you so much for your help. I did everything step by step and got a very good result I think what helped me most was up scaling the images. the code I did is in python and is the following if anyone is following the thread:

import PIL
from PIL import Image

im = Image.open(imagepath)
hpercent = (baseheight / float(img.size[1]))
wsize = int((float(img.size[0]) * float(hpercent)))
img = img.resize((wsize, baseheight), PIL.Image.ANTIALIAS)

I'm a real newbie in bash so I didn't use your scripts I kept getting a permission error.  Thank you again Lorenzo! 




To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Lorenzo Bolzani

unread,
Jul 31, 2018, 5:30:49 AM7/31/18
to tesser...@googlegroups.com
I'm happy to hear that and thank you for letting me know. I was wondering if the instructions were just a mess or too long :)


Bye

Lorenzo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lorenzo Bolzani

unread,
Oct 15, 2018, 7:10:47 AM10/15/18
to tesser...@googlegroups.com

Just a small note (in case someone will land on this thread): I recently found out that PSM 7 and others work better than 13.

See: https://github.com/tesseract-ocr/tesseract/issues/1778#issuecomment-429527692

kotom...@gmail.com

unread,
Mar 23, 2019, 11:11:58 PM3/23/19
to tesseract-ocr
Hi, i feel confused why upscaling works.Actually,  in the tesseract, it also has the process to prescale the image to height 36pix. 

在 2018年7月30日星期一 UTC+8下午11:19:23,Emiliano Isaza Villamizar写道:

Zdenko Podobny

unread,
Mar 24, 2019, 3:28:14 AM3/24/19
to tesser...@googlegroups.com
Tesseract is OCR library e.g.  user is responsible for image preprocessing.

Zdenko


ne 24. 3. 2019 o 4:12 <kotom...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages