recognition accuracy is very sensitive to holes on character

blues

unread,

Jun 21, 2018, 11:50:14 PM6/21/18

to tesseract-ocr

Hi all,

I'm using tesseract for number plate recognition.(openalpr) it passes single character to tesseract for recognition.

I found that recognition accuracy is very sensitive to holes on character.

if the character in binary image has one or more small holes on it, than its likely to get a wrong result.

for example, this "0" is falsely recognized as "Z"

but just a single pixel different, which opens the hole on its upper part, than its correctly recognized as "0"

some more examples are attached.

I can not predict where the holes going to be, because it caused by noise in image. so I think it should not be added into training samples.

Is there a way to fix it? to make recognition robust to small noise

thank you

leu.traineddata

6-1.png

6-1-as-7.png

7-1.png

7-1-as-x.png

0-1.png

0-1-as-z.png

0-2.png

0-2-as-q.png

3-1.jpg

3-1-as-h.jpg

5-1.jpg

5-1-as-7.jpg

Lorenzo Bolzani

unread,

Jun 22, 2018, 4:06:01 AM6/22/18

to tesser...@googlegroups.com

I'd try to upscale the images so that one letter is about 40/50 pixels tall and see if that helps.

I'd also try a morphological open/erode operation (or a blur/resharpen) to simply fill the holes.

I do not know if there are any special parameters for this kind of problems (that I've encountered too).

In general, adding noise to training data make the model more robust. You may use custom code or something like imgaug to generate random variations with random white spots and other corruptions.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f59708b-d55a-499b-9ce6-035f492dfe89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

blues

unread,

Jun 22, 2018, 5:41:09 AM6/22/18

to tesseract-ocr

thanks for your reply, Lorenzo

I will test more samples to see if it only happens with holes.

if so, probably just do a morph hole filling before ocr as workaround for now.

btw, I'm using version 3.x. Is there a chance 4.x handles this issue better?

Lorenzo Blz於 2018年6月22日星期五 UTC+8下午4時06分01秒寫道：

Lorenzo Bolzani

unread,

Jun 22, 2018, 9:41:52 AM6/22/18

to tesser...@googlegroups.com

2018-06-22 11:41 GMT+02:00 blues <blue...@gvdigital.com>:

thanks for your reply, Lorenzo
I will test more samples to see if it only happens with holes.
if so, probably just do a morph hole filling before ocr as workaround for now.

btw, I'm using version 3.x. Is there a chance 4.x handles this issue better?

I assumed you were using the 4.x version and that you attached your trained data file. Yes, I expect 4.x to be more robust on these things and maybe overall but I've never seen any side by side comparison (I'd like to see one).

It's quite easy to try as see. You may also do custom fine tuning training on your data (if you have classified data).

Lorenzo

Reply all

Reply to author

Forward