Recognize

87 views
Skip to first unread message

Marcel Robitaille

unread,
May 28, 2022, 12:37:54 AM5/28/22
to tesseract-ocr
Hello. I am trying to recognize the last 4 digits of credit cards in pictures of receipts. Usually, these have 16 asterisks with the last 4 digits afterwards with no spaces. I have included an example here without showing all 4 digits of the credit card for security, but showing 2 so you can see that the numbers are showing up reasonably well. This is cropped from a larger receipt.

The problem is that the output from tesseract for this line is KKKKKKKEKEKERBQIGL. I thought I could get around this by specifying `--user-patterns`. I created a file `eng.user-patterns` with the contents `\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\d\d\d\d`. I also tried `****************\d\d\d\d` because I am not sure if I have to escape * or only \. I ran this with `tesseract image.jpg output.txt -l eng --user-patterns eng.user-patterns`, but the output does not seem to be affected. That line is still the same gibberish. I tried user words with the exact last 4 digits I am looking for, but same result. I am using tesseract 4.1.1.

Is there anything I can try besides retraining? It seems like not such a hard case.

Thanks
credit-card.jpg
Reply all
Reply to author
Forward
0 new messages