Column splitting failed around fuzzy line

32 views
Skip to first unread message

Ewan Mellor

unread,
Apr 11, 2018, 2:16:59 AM4/11/18
to tesseract-ocr

Hi,


I am using Tesseract 4 (git 10f4998a) to process a file with two columns.  A snippet of the image is shown below.  The problem is that there is a fuzzy line between the two columns, and the column detector has got confused.  I've ended up with one block covering the first column up to "The" on the second line, but then a block covering both columns with the "patient has ..." all the way across to "history of low".


I've looked in the debug views, and it looks to me like the line removal hasn't managed to remove that fuzzy line down the middle.  The "good" is then close enough that the column finder is deciding to merge the two blocks on that line.


Looking at the code in linefind.cpp and colfind.cpp, I see lots of constants for various thresholds, but I don't see any configurable ones, and I'm not sure which way to go now.  Would it be better to work on the line detector in linefind.cpp and try and get rid of that vertical line?  Or would I be better to run a columnar histogram and try and do column splitting myself?  Or should I ignore the fact that the line hasn't been removed, and concentrate on tightening up the column finder so that it's able to separate these two columns correctly?  It seems to me that there's enough of a gap there that it ought to be able to separate the columns (it does a pretty good job on the rest of the document, so it can't be far off).


Any recommendations would be appreciated.


Thanks,


Ewan.





ShreeDevi Kumar

unread,
Apr 11, 2018, 6:28:42 AM4/11/18
to tesser...@googlegroups.com
Try to look at leptonica sample programs about column splitting to see if you can preprocess the image better, before giving to tesseract


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bdee5651-c305-4bbb-a14c-ccd5ba5cd7e2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages