"Line separator regions" capabilities?

Brad

unread,

Apr 27, 2022, 1:10:24 PM4/27/22

to tesseract-ocr

For V5.10.0 of Tesseract, one of the changes is:

> Handle image and line separator regions in ALTO, hOCR and text output formats.

I'm curious about what this means. Can Tesseract be used to identify rectangles and such on an image that might surround a text region, and if so, is this what this is referring to? Are there any examples showing how this works?

Thanks,

Brad

Merlijn B.W. Wajer

unread,

Apr 27, 2022, 1:18:39 PM4/27/22

to tesser...@googlegroups.com

Hi,

On 27/04/2022 19:07, Brad wrote:
> For V5.10.0 of Tesseract, one of the changes is:

(correction: version 5.1.0)

>> Handle image and line separator regions in ALTO, hOCR and text output
> formats.
>
> I'm curious about what this means. Can Tesseract be used to identify
> rectangles and such on an image that might surround a text region, and if
> so, is this what this is referring to? Are there any examples showing how
> this works?

Here is the commit in question:
https://github.com/tesseract-ocr/tesseract/commit/424b17f997363670d187f42c43408c472fe55053
(for some background see
https://github.com/tesseract-ocr/tesseract/pull/3710)

The output added to say hOCR is "ocr_photo" and "ocr_separator". You can
see how the results are iterated over in the source if you would like to
use that yourself.

My/our immediate use case is detecting photos on pages of books and
articles, which will be emitted as ocr_photo when outputting hOCR.

I don't know if this can help in your specific use case, but if you're
interested in finding images, it will help for sure. I cannot really
comment on the ocr_separator parts so much.

Regards,
Merlijn

Brad

unread,

Apr 27, 2022, 1:53:04 PM4/27/22

to tesseract-ocr

Thanks for the information, Merlijn. Will take a look at some of the links you posted.

Reply all

Reply to author

Forward