Page layout analysis module

2,223 views
Skip to first unread message

Prodoc

unread,
Jun 19, 2011, 4:07:03 PM6/19/11
to tesseract-ocr
Hi,

In version 3 of tesseract-ocr there's a new page layout analysis
module. I'm interested to learn in what way it is used and how it can
be used.

Does it provide additional user functionality or is it only used
internally? I.e. can I query it somehow to output all recognized text
areas (position and dimensions) without its actual text content?
Does it have any influence on the mark-up of the text output? I.e.
e.g. additional line breaks between text in case of a new paragraph.
I've played with the different pagesegmode values (0-3) but it gives
me the exact same output for each of them. Do these settings have
anything to do with the layout analysis?

If recognizing text areas is what it does but you can't output just
the position and dimensions of them, it would be great to see this as
a new feature. In a program like gImageReader you have to do this
manually, OCRFeeder tries to do it automatically. If tesseract-ocr's
analysis is more accurate, one could use that as an input for
OCRFeeder again.

Yours,

Age Bosma

patrickq

unread,
Jun 20, 2011, 5:56:33 AM6/20/11
to tesseract-ocr
You can definitely get just layout analysis before text recognition -
look at the FindLinesCreateBlockList() API and the BLOCK_LIST data
structure. You can then iterate through that structure to look at
blocks and rows within these blocks. Keep in mind that a sentence in
the image could be broken out into separate boxes altogether if you
have anything more complex than a simple page, so you'll have to do
the stiching yourself of rows in entirely different boxes, based on
their coordinates. There are even cases where you might get
"Patrick"returned as one row containing "Ptrik" and one row containing
"ic" - rare but happens too, especially when the text line has a slope
(even if very moderate).

Patrick

Age Bosma

unread,
Jun 20, 2011, 8:19:03 AM6/20/11
to tesser...@googlegroups.com
Thank you for your reply.

Nice to learn that it is possible programming-wise. I should, however,
have been more clear that I was referring to command-line functionality.

Would it be an idea to extend the tesseract command-line tools to have
it output containing block dimensions?

So one option to output just the text (current behaviour):
--------------------------------
Some text
And yet again some other text
--------------------------------

A second option to output the text marked with it's block dimensions:
--------------------------------
[block:10,20,250,20]
Some text
[block:350,400,600,410]
And yet again some other text
--------------------------------

A a third option to output just all blocks:
--------------------------------
[block:10,20,250,20]
[block:350,400,600,410]
--------------------------------

Yours,

Age

signature.asc

Teng Long

unread,
Mar 7, 2016, 3:56:40 AM3/7/16
to tesseract-ocr, ageb...@gmail.com

Hi Age, I'm a newbie in OCR.
You mentioned 3 option to use tesseract, 
could you please tell me how to use this 3 options?

any command is appreciated.
Like:
       tesseract sample2.jpg ouput -l eng -psm 3

Thank you !

Age Bosma

unread,
Mar 8, 2016, 10:21:00 AM3/8/16
to tesseract-ocr, ageb...@gmail.com
Hi Teng,

The options I mention aren't available in tesseract. I listed them as suggestions for extending tesseract. They haven't been implemented as far as I know.

Best regards,

Age

zdenko podobny

unread,
Mar 8, 2016, 10:56:36 AM3/8/16
to tesser...@googlegroups.com, ageb...@gmail.com
IMO it is - in hocr (xml) output or tsv (in master branch a.k.a 3.05)

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8c929e9d-c33a-4978-a15a-1dd4f854b50b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Age Bosma

unread,
Mar 8, 2016, 11:19:30 AM3/8/16
to tesseract-ocr, ageb...@gmail.com
Hi Zdenko,

Man, would I have liked getting that hint 5 years ago... :-/

Best regards,

Age Bosma
Reply all
Reply to author
Forward
0 new messages