Tesseract 3 and paragraph separation

1,347 views
Skip to first unread message

Demian Katz

unread,
Mar 22, 2012, 3:59:54 PM3/22/12
to tesseract-ocr
Hello,

I'm using Tesseract 3 as a simple command-line tool to generate OCR.
It's doing a fairly good job, but I have one unmet need -- I need to
be able to separate paragraphs with blank lines. It would be great if
Tesseract could do this for me, but I'd even be happy if it could
include indentation whitespace in the text so I could perform the
splitting using my own software.

Is there any way to achieve this effect? On a somewhat related note,
is there any way to control Tesseract's command line behavior at all?
I see that it accepts a config file as a command-line option, but I'm
having no luck finding documentation on what options are available or
what they mean -- the provided examples don't actually seem to work,
and even searching the code hasn't given me anything resembling a list
of valid options.

Any help or pointers in the right direction would be greatly
appreciated!

thanks,
Demian

TP

unread,
Mar 23, 2012, 4:19:38 AM3/23/12
to tesser...@googlegroups.com
On Thu, Mar 22, 2012 at 12:59 PM, Demian Katz <demia...@villanova.edu> wrote:
> Hello,
>
> I'm using Tesseract 3 as a simple command-line tool to generate OCR.
> It's doing a fairly good job, but I have one unmet need -- I need to
> be able to separate paragraphs with blank lines. It would be great if
> Tesseract could do this for me, but I'd even be happy if it could
> include indentation whitespace in the text so I could perform the
> splitting using my own software.
>
> Is there any way to achieve this effect?

One choice is to dump out hocr instead of just UTF8 text. So do:

tesseract test.tif test hocr

where hocr is the name of the built-in config file that is in
tessdata/configs. This will generate test.html instead of test.txt.
See [1] for a bit more info on hOCR.

If you aren't afraid of doing some programming, look at the code for
TessBaseAPI::GetHOCRText [2]. It uses
res_it->IsAtBeginningOf(RIL_PARA) to figure out where each paragraph
begins.

> On a somewhat related note,
> is there any way to control Tesseract's command line behavior at all?
> I see that it accepts a config file as a command-line option, but I'm
> having no luck finding documentation on what options are available or
> what they mean -- the provided examples don't actually seem to work,
> and even searching the code hasn't given me anything resembling a list
> of valid options.
>
> Any help or pointers in the right direction would be greatly
> appreciated!
>
> thanks,
> Demian

AFAIK there aren't any good docs on config files yet (I'm working on
that). But look in tessdata/configs & tessdata/tessconfigs for example
config files. To get a list of possible config file parameters, see
this thread [3], in particular this message by me [4].

[1] http://en.wikipedia.org/wiki/HOCR

[2] http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.cpp#932

[3] http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2eda8cda1d5557c1/

[4] http://groups.google.com/group/tesseract-ocr/msg/73565d039201f2e6

TP

unread,
Mar 23, 2012, 5:05:19 AM3/23/12
to tesser...@googlegroups.com
On Fri, Mar 23, 2012 at 1:19 AM, TP <win...@gmail.com> wrote:
> If you aren't afraid of doing some programming, look at the code for
> TessBaseAPI::GetHOCRText. It uses

> res_it->IsAtBeginningOf(RIL_PARA) to figure out where each paragraph
> begins.

I took a look at TessBaseAPI::GetUTF8Text() [1], and that's an even
better place to start.You just add a linefeed after each paragraph's
text.

[1] http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.cpp#901

TP

unread,
Mar 23, 2012, 5:21:17 AM3/23/12
to tesser...@googlegroups.com
On Thu, Mar 22, 2012 at 12:59 PM, Demian Katz <demia...@villanova.edu> wrote:
> I'm using Tesseract 3 as a simple command-line tool to generate OCR.
> It's doing a fairly good job, but I have one unmet need -- I need to
> be able to separate paragraphs with blank lines.

Hmmm, I just tried this on a sample image (something I should have
done first), and the latest trunk version (3.02) of tesseract already
puts blank lines between paragraphs.

Demian Katz

unread,
Mar 23, 2012, 7:23:17 AM3/23/12
to tesser...@googlegroups.com
Thanks for all of the detailed information -- this is very helpful. I've been working from the 3.00 release (which I see isn't even the latest published version now -- I'm further behind the times than I realized!) and will try updating to the latest trunk next week.

Regarding the configuration files, I did try some of the samples included with 3.00, but I got error messages about invalid parameters. Perhaps this has all been fixed by 3.02; I'll follow up if I'm still having trouble after the upgrade.

Thanks again for your help!

- Demian
________________________________________
From: tesser...@googlegroups.com [tesser...@googlegroups.com] On Behalf Of TP [win...@gmail.com]
Sent: Friday, March 23, 2012 5:21 AM
To: tesser...@googlegroups.com
Subject: Re: Tesseract 3 and paragraph separation

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply all
Reply to author
Forward
Message has been deleted
0 new messages