unable to recognize numbers within box using tesseract in C#

721 views
Skip to first unread message

shripad shirsat

unread,
Aug 27, 2016, 11:29:52 AM8/27/16
to tesseract-ocr

I am facing to issue to recognize the numbers from pdf which are printed within the boxes. I have used tesseract in C# for my project. Kindly some one help me out with any clue or hint or a snippet to how to go about to find the solution for the same. Please find the attached pdf
Test.pdf

Quan Nguyen

unread,
Aug 27, 2016, 12:34:10 PM8/27/16
to tesseract-ocr
Deskew, grayscale, remove lines, binarize produced the image:


and OCRed text:

l4|0|0l2|1l1>°l0|7l

So if you could remove the vertical lines, it would improve further.

shripad shirsat

unread,
Aug 30, 2016, 9:09:55 AM8/30/16
to tesseract-ocr
Thank you very much for your valuable suggestion. Can you just help me out in how to remove the horizontal lines as I am processing this image in C# code and is there any tool which i can use to remove the horizontal line or any code snippet i can refer.
Message has been deleted

Quan Nguyen

unread,
Sep 1, 2016, 12:17:43 AM9/1/16
to tesseract-ocr
See Pix.RemoveLines method.

ble...@gmail.com

unread,
Feb 9, 2017, 7:01:50 AM2/9/17
to tesseract-ocr

Hello and thank you for the useful suggestion.

Would you happen to know the reason why numbers printed within boxes cannot be parsed and are ignored?

I am working on scenarios that numbers withing closed boxes are very very common and removing the horizontal lines have various side effects on other pieces of text on my images.

Is there a reason for this and maybe another way to make tesseract detect the numbers printed within boxes (maybe with passing a parameter or something)?

Thank you in advance for your answer.
Reply all
Reply to author
Forward
0 new messages