Improve text extraction

150 views
Skip to first unread message

Atef Chatty

unread,
Jul 20, 2022, 4:33:16 PM7/20/22
to tesseract-ocr
Hi,
i want to extract information from unclear images. I tried many filters but it doesn’t help. This is some example :
This is the input pictures:   Example.png
So why i want to extract this informations ? :
I am working on a project to extract information from driver’s licenses.
The extraction is good with some images and bad with others.
I tried different image processing to improve the quality of the extraction. It helps but not with all the images because the images don’t have the same degree of blur, luminosity… (I tried all kinds of filters but I didn’t find a good filter for all the images).
So I calculated the blur and brightness of all the images to find a filter criterion or condition, but the result was not clear to me.
Any suggestion to understand which filter I should use?
Example.PNG

Lorenzo Bolzani

unread,
Jul 22, 2022, 4:15:42 AM7/22/22
to tesser...@googlegroups.com
Hi Atef,
I think your best option is to generate a lot of images as bad as this one and use them for training.

So you take the good images (with the corresponding text), thousands, and ruin/blur them in many different ways. In this way, for example, from good 1000 images you get 5000/10000 bad images.

Then you fine tune the model using these.

With pre-processing it is a lot of work and you are not going to get good reliable results anyway. If you still want to try you can use the white border of the image as a reference for the average image brightness and fix the brightness according to this. Or try CLAHE.

Process manually with Gimp this image first to see IF this helps and what brightness/contrast is the best. It's faster to do this manually. Once you have reliable a "pre-processing" try to replicate it with code.


Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/658d6de5-8f75-4f76-8651-f6d87b5407b5n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages