OCR of FAX Images

68 views
Skip to first unread message

farhad khalafi

unread,
Dec 4, 2018, 2:39:33 PM12/4/18
to tesseract-ocr
Hello,

Some older fax machines used different DPI in horizontal and vertical directions. This often resulted in images with jagged lines as in the file that I have attached to this post.

I am trying to find a way to smooth these out to improve OCR accuracy. The images are already in 1-bit monochrome and compressed using CCITT G3 (FAX) format.

My objectives are to smooth the lines and edges, as well as, remove the grid lines if possible. These are archived images and the source paper documents are no longer available.

Any guidance on appropriate image processing algorithms (preferably using Leptonica) will be greatly appreciated.


Thanks,
Farhad
Fax.png

John Muccigrosso

unread,
Dec 6, 2018, 11:50:20 AM12/6/18
to tesseract-ocr
Can you use ImageMagick to replace some of the more obvious patterns with the correct ones, that is, to unshift the little bars? It might have some false positives, but from the looks of it, I'd guess that it would be an improvement.

farhad khalafi

unread,
Dec 6, 2018, 12:57:19 PM12/6/18
to tesseract-ocr
Thanks for your suggestion. Do you know the specific ImageMagick commands that will achieve this shifting of lines? 

One idea I had was to fill in small holes that were bounded on three sides and also remove speckles that were attached on only one side. But this will probably distort the characters too much.
Reply all
Reply to author
Forward
0 new messages