Q: OCR on document with registerable marks?

47 views
Skip to first unread message

Zunair Fayaz

unread,
Oct 16, 2014, 11:04:58 AM10/16/14
to tesser...@googlegroups.com
need best practice to OCR on documents with + sign that helps align the documents.

Any known practice?

See attached file that I'm trying to OCR and get perfect results.
Currently, I'm cropping at the top with 18 percent height of the document... and if needed remove the border using accusoft scanfix.
Then OCR just that, so I get some blank lines and then + + then numbers...

My problem right now is that when all chars are used, 1 becomes i because of that speck in this document. (I can de-speckle if there is no other way to improve)
If I use only digits.. only 0-9, then I get a weird result, I get an extra 5 just below the speck.

Is there simple a way to find this line and use a constant height to OCR this line? so the speck will not be in that rectangle?
Is there a way to get the positions of those + signs in pixels?

Please advise.


Thank you

00000039_x.tif

Zunair Fayaz

unread,
Oct 20, 2014, 8:36:58 AM10/20/14
to tesser...@googlegroups.com
Anyone?

Tom Morris

unread,
Oct 20, 2014, 1:57:35 PM10/20/14
to tesser...@googlegroups.com
On Thursday, October 16, 2014 11:04:58 AM UTC-4, Zunair Fayaz wrote:
need best practice to OCR on documents with + sign that helps align the documents.

Those are typically referred to as "registration marks" 
 
Any known practice?

I would have thought they'd be pretty easy to detect using a simple black/white histogram on the scanned rows of pixels at the top of the page.  Have you tried that?  What approaches have you tried?
 

See attached file that I'm trying to OCR and get perfect results.
Currently, I'm cropping at the top with 18 percent height of the document... and if needed remove the border using accusoft scanfix.
Then OCR just that, so I get some blank lines and then + + then numbers...

My problem right now is that when all chars are used, 1 becomes i because of that speck in this document. (I can de-speckle if there is no other way to improve)
If I use only digits.. only 0-9, then I get a weird result, I get an extra 5 just below the speck.

Is there simple a way to find this line and use a constant height to OCR this line? so the speck will not be in that rectangle?
Is there a way to get the positions of those + signs in pixels?

If it's always just a single isolated line, it should be pretty easy to detect even without the registration marks.

Tom
Reply all
Reply to author
Forward
0 new messages