Unable to identify simple 6 digit numbers

83 views

Skip to first unread message

Rob Shanks

unread,

Jun 1, 2016, 9:14:21 AM6/1/16

to tesseract-ocr

I am trying to use Tesseract to recognise a 6 digit number from scanned documents. Because they are scanned the numbers can be faded but I know that they are 6 digit. The scanned documents have been destroyed long ago so I am not able to get them rescanned.

I have tried whitelisting 0-9 and using a user patern file \d\d\d\d\d\d and setting psm(8) to say it is looking for one word but nothing seems to improve detection.

Does anyone have any suggestions?

Thanks

Rob

defectsheet1.pdf

defectsheet2.pdf

Tom Morris

unread,

Jun 2, 2016, 6:58:24 PM6/2/16

to tesseract-ocr

I'd play with modifying the images by hand to see what types of operations improve performance. The first two things I'd try would be:

1. Line removal - see if cropping/removing the thing horizontal and vertical lines improves performance

2. Dilation or other image operators to "fill in" the dropouts (ie white areas) in the numerals