OCR on photos of IDs?

Tom Apeltauer

unread,

Feb 21, 2020, 12:12:42 AM2/21/20

to tesseract-ocr

Greetings everyone,

I am standing in front of quite challenging task. Optical Character Recognition of the data from the IDs taken by smartphone camera. I have tried tesseract as-is, but the accuracy rate is somewhere around 40%.

I have started tweaking around, disabling dictionaries and preprocessing images to grayscale, using different page segmentation methods, but each setting produces various and different accuracy on different photos.

I am asking you guys as experts in the field if there are some tips you could give me? See example here:

https://drive.google.com/open?id=14PDZlbJ-HNFcHsPlE28cBT5VIxV9ceqW

Dont mind the red parts. I have been doing at least some basic "protection". You have seen nothing obviously.

Thanks!

Tom A.

Ajinkya Bobade

unread,

Feb 21, 2020, 3:46:21 AM2/21/20

to tesser...@googlegroups.com

Hello,

I have solved this problem for multiple clients for past 3 years. I can walk you through the steps.

You can reach out at ajinkya...@gmail.com

Regards

Ajinkya

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4c9e4681-23dd-435b-a6f2-73ab78a122e7%40googlegroups.com.

Adrian Enders

unread,

Feb 21, 2020, 11:17:45 AM2/21/20

to tesseract-ocr

Would you be willing to share some of these steps here in the group? I would be curious to know what some of your techniques are. I have worked with this in the past as well with other technologies.

- Adrian

On Friday, February 21, 2020 at 1:46:21 AM UTC-7, Ajinkya Bobade wrote:

Hello,

I have solved this problem for multiple clients for past 3 years. I can walk you through the steps.
You can reach out at ajinkya...@gmail.com

Regards
Ajinkya

On Fri, Feb 21, 2020 at 10:42 AM Tom Apeltauer <apel...@gmail.com> wrote:

Greetings everyone,

I am standing in front of quite challenging task. Optical Character Recognition of the data from the IDs taken by smartphone camera. I have tried tesseract as-is, but the accuracy rate is somewhere around 40%.

I have started tweaking around, disabling dictionaries and preprocessing images to grayscale, using different page segmentation methods, but each setting produces various and different accuracy on different photos.

I am asking you guys as experts in the field if there are some tips you could give me? See example here:

https://drive.google.com/open?id=14PDZlbJ-HNFcHsPlE28cBT5VIxV9ceqW

Dont mind the red parts. I have been doing at least some basic "protection". You have seen nothing obviously.

Thanks!
Tom A.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4c9e4681-23dd-435b-a6f2-73ab78a122e7%40googlegroups.com.

THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL, intended for the sole use of the addressee(s), and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately.

Tom Apeltauer

unread,

Feb 24, 2020, 4:24:31 AM2/24/20

to tesseract-ocr

Ajinkya and I basically agree that Tesseract has to be retrained for this specific case. What helped me quite a lot is a walktrough by Guiem: https://medium.com/@guiem/how-to-train-tesseract-4-ebe5881ff3b7

Also some image cutting will be probably needed.

- Tom

Dne pátek 21. února 2020 17:17:45 UTC+1 Adrian Enders napsal(a):

Reply all

Reply to author

Forward