Edge detection algorithm used by tesseract

1,461 views
Skip to first unread message

shahin youssefi

unread,
Jun 23, 2012, 2:43:53 AM6/23/12
to tesseract-ocr
Hello dear friends,
I wonder if anybody knows what edge detection algorithm does
tesseract 3.01 utilize when finding connected components?
More specifically in edgblob.cpp file there is a function called
"extract_edges" in which a function named "block_edges" is called
which is responsible to extract edges and find the outline of a block.
Correct me if I'm wrong but it seems that "block_edges" doesn't use
famous edge-detection methods like Canny, Sobel or Prewitt.
Thanks in advance.

Dmitri Silaev

unread,
Jun 23, 2012, 5:10:48 AM6/23/12
to tesser...@googlegroups.com
block_edges() has nothing to do with edge detection. Tesseract does
not use it at all. It first binarizes entire images then extracts
connected components (CCs). block_edges() is called to extract CCs'
outlines from a binarized image.

Warm regards,
Dmitri Silaev
www.CustomOCR.com
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

Dmitri Silaev

unread,
Jun 23, 2012, 10:19:55 AM6/23/12
to tesser...@googlegroups.com
You were talking about Canny, Sobel, etc. and these indeed relate to edge detection in its common sense (http://en.wikipedia.org/wiki/Edge_detection). And in this sense Tesseract does not do any edge detection.

Yes, one might call the process of finding CC contours in binary image edge detection. But seeing it as conventional edge detection would make the task completely degenerate and thus using the approaches from conventional edge detection would be totally unreasonable, and some of them - unusable at all. Why there's a comment in the source code saying it's an "edge detector" although this notion has other common meaning? That should be addressed to developers. I suppose this is because internally they refereed to CC contours as "edges" and they used to call their method of contour extraction as "crack edges".

I would refrain from considering myself an authority in all that's related to naming and notions, though.

What you have shown in your image is not what is produced by extract_edges() or block_edges(). Those build completely different structures, similar to that is commonly known as crack coded CC boundaries.


Warm regards,
Dmitri Silaev
www.CustomOCR.com


On Saturday, June 23, 2012 2:47:04 PM UTC+4, shahin youssefi wrote:
Dmitri, you are correct, this function only set the bounding box of ,em, not exactly CCs.
if the character has a closed curve in it, the inner area is returned as an outline. for example [this].
I've shown the result of the "extract_edges" in green lines.
> tesseract-ocr+unsubscribe@googlegroups.com
Reply all
Reply to author
Forward
0 new messages