Hi,
On 27/04/2022 19:07, Brad wrote:
> For V5.10.0 of Tesseract, one of the changes is:
(correction: version 5.1.0)
>> Handle image and line separator regions in ALTO, hOCR and text output
> formats.
>
> I'm curious about what this means. Can Tesseract be used to identify
> rectangles and such on an image that might surround a text region, and if
> so, is this what this is referring to? Are there any examples showing how
> this works?
Here is the commit in question:
https://github.com/tesseract-ocr/tesseract/commit/424b17f997363670d187f42c43408c472fe55053
(for some background see
https://github.com/tesseract-ocr/tesseract/pull/3710)
The output added to say hOCR is "ocr_photo" and "ocr_separator". You can
see how the results are iterated over in the source if you would like to
use that yourself.
My/our immediate use case is detecting photos on pages of books and
articles, which will be emitted as ocr_photo when outputting hOCR.
I don't know if this can help in your specific use case, but if you're
interested in finding images, it will help for sure. I cannot really
comment on the ocr_separator parts so much.
Regards,
Merlijn