One choice is to dump out hocr instead of just UTF8 text. So do:
tesseract test.tif test hocr
where hocr is the name of the built-in config file that is in
tessdata/configs. This will generate test.html instead of test.txt.
See [1] for a bit more info on hOCR.
If you aren't afraid of doing some programming, look at the code for
TessBaseAPI::GetHOCRText [2]. It uses
res_it->IsAtBeginningOf(RIL_PARA) to figure out where each paragraph
begins.
> On a somewhat related note,
> is there any way to control Tesseract's command line behavior at all?
> I see that it accepts a config file as a command-line option, but I'm
> having no luck finding documentation on what options are available or
> what they mean -- the provided examples don't actually seem to work,
> and even searching the code hasn't given me anything resembling a list
> of valid options.
>
> Any help or pointers in the right direction would be greatly
> appreciated!
>
> thanks,
> Demian
AFAIK there aren't any good docs on config files yet (I'm working on
that). But look in tessdata/configs & tessdata/tessconfigs for example
config files. To get a list of possible config file parameters, see
this thread [3], in particular this message by me [4].
[1] http://en.wikipedia.org/wiki/HOCR
[2] http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.cpp#932
[3] http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2eda8cda1d5557c1/
[4] http://groups.google.com/group/tesseract-ocr/msg/73565d039201f2e6
I took a look at TessBaseAPI::GetUTF8Text() [1], and that's an even
better place to start.You just add a linefeed after each paragraph's
text.
[1] http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.cpp#901
Hmmm, I just tried this on a sample image (something I should have
done first), and the latest trunk version (3.02) of tesseract already
puts blank lines between paragraphs.
Regarding the configuration files, I did try some of the samples included with 3.00, but I got error messages about invalid parameters. Perhaps this has all been fixed by 3.02; I'll follow up if I'm still having trouble after the upgrade.
Thanks again for your help!
- Demian
________________________________________
From: tesser...@googlegroups.com [tesser...@googlegroups.com] On Behalf Of TP [win...@gmail.com]
Sent: Friday, March 23, 2012 5:21 AM
To: tesser...@googlegroups.com
Subject: Re: Tesseract 3 and paragraph separation
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en