segmentation algorithm

60 views
Skip to first unread message

Hovnatan Karapetyan

unread,
Jun 23, 2016, 6:11:34 AM6/23/16
to scantailor-devel
Hi,

I was wondering what algorithm Scantailor uses for segmenting text from non-text parts of the page. Can you please help?

Thanks a lot,
Hovnatan

Joseph Artsimovich

unread,
Jun 23, 2016, 8:23:12 AM6/23/16
to scantail...@googlegroups.com
On 23/06/2016 11:11, Hovnatan Karapetyan wrote:
> Hi,
>
> I was wondering what algorithm Scantailor uses for segmenting text
> from non-text parts of the page. Can you please help?

It's picture / non-picture segmentation rather than text / non-text one.
Look at detectPictures() in filters/OutputGenerator.cpp:
https://github.com/scantailor/scantailor/blob/master/filters/output/OutputGenerator.cpp#L1335
It's a simple custom algorithm based on grayscale morphology.

There is also text line segmentation / localization as part of
dewarping. For that one you should look at Scan Tailor Experimental:
https://github.com/Tulon/scantailor/blob/experimental/dewarping/TextLineSegmenter.cpp
The idea is to first emphasize text lines with a filter bank of oriented
gaussians, similar to what they do in this paper:
https://scholar.google.com/scholar?q=Text-Line+Extraction+Using+a+Convolution+of+Isotropic+Gaussian+Filter+with+a+Set+of+Line+Filters
From there, a custom ridge detector (based on graph algorithms) is applied.

Finally, there is a very crude text / non-next line labelling as part of
content box detection. Look at estimateTextMask()
in filters/select_content/ContentBoxFinder.cpp:
https://github.com/Tulon/scantailor/blob/experimental/filters/select_content/ContentBoxFinder.cpp#L73
This algorithm is not terribly accurate. It labels a binary image region
as text / non-text based on the number
of ultimate eroded points (abbreviated UEPs in source code) in that region.

As usual, you can get a lot of insight on the inner workings of Scan
Tailor by enabling debug mode:
Menu -> Tools -> Debug Mode

--
Joseph Artsimovich

Hovnatan Karapetyan

unread,
Jun 23, 2016, 2:31:49 PM6/23/16
to scantailor-devel
Thanks, I will look into the code.
Hovnatan
Reply all
Reply to author
Forward
0 new messages