How do I improve my accuracy with a small set of numbers?

177 views

Skip to first unread message

Ryan

unread,

May 23, 2015, 2:45:25 AM5/23/15

to tesser...@googlegroups.com

I have to read sets of numbers from a very large number of cards, but I need good accuracy. 15 digits and a 4 digit pin. There's no check digit, but some digits are the same on every card and the font and spacing are the same. I've attached a sample image below. I tested tesseract on that image, and several others like it and I'm getting pretty poor accuracy, 80% or less sometimes. I know better cropping and getting the rotation correct will help, but I'm still getting poor accuracy after manually cropping them. I also thought about processing the pin part separately, and pulling the crop in closer. This is tricky because the spacing is consistent between the 15 digit part and the pin, but whole set of numbers is not located in precisely the same place on each card. I'm sure I can write some code that would use the first number as a reference point and crop the pin separately and much tighter, but I'd rather not write it if it won't help.

I've already written a little program so I can put a card under the camera, press a key, and it will display the cropped image above the tesseract output so I can manually confirm. I just need to figure out what I need to do to improve tesseract's performance, because so far I haven't had a single card recognized accurately. I expected some difficulty with the background noise around the pin, but I'm suprised at getting poor recognition even on the first 15 digits. I've got a better camera on order, and I'm going to make a little frame to hold the cards so I'll be able to get perfectly cropped and rotated images, and much better image quality. What else can I do to improve my accuracy in this situation? Is this a case where training would help? I'm open to any idea that can be made automated.

Dmitri Silaev

unread,

May 27, 2015, 5:22:30 PM5/27/15

to tesser...@googlegroups.com

Hi Ryan,

I can suggest the following:

- Use higher resolution and don't use JPEG. At such resolution and compression level you are doomed to poor OCR results because character strokes get literally ruined. It's not clear if your former camera is able to do better but I suspect it is; at least it can use higher JPEG compression level. So probably you won't need another cam.

- A fixture... Mmm I don't think it's necessary. Most of your target text is quite well distinguishable and localizable. If you just provide good lighting, focus and position camera sanely - that's enough. The rest can be done by the same old ImageMagick. OTOH if you're required to process thousands of cards the fixture would be just convenient.

- Training. No it won't help at all. Your digits are very similar to what stock (English) traineddata files already have in them.

- Cropping. If your typical photos contain much complex surrounding - it's necessary to strip that off. If it's just the card itself - Tess would work well.

- Rotation. No need. Tess can handle it well, even for the skew level you have shown in your image. See my results below.

- PIN. Here you'd probably need to work in color domain. Scratch leftovers definitely would need to be filtered out. Show us a color variant.

In fact show your entire unedited source, maybe also a couple of other images. Probably the community might help you better.

What I have achieved so far:

"2rsz.jpg" - Your source image upscaled 4x. This allows to mitigate a bit those destructive JPEG compression artifacts.
>tesseract inet012\2rsz.jpg inet012\2rsz.jpg -psm 7 -c tessedit_char_whitelist=0123456789

Result: "2rsz.jpg.txt"

All number digits are perfect. Though, don't expect it to work good for the PIN part - it requires cleaning.

I used the single line PSM and restricted possible characters by a "whitelist".

Best regards,
Dmitri Silaev
www.CustomOCR.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3634aa1c-24c2-499d-a2a6-1e711da3962c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

2rsz.jpg.txt

2rsz.jpg

Reply all

Reply to author

Forward

0 new messages