OCR Per Page Basis

257 views
Skip to first unread message

Paul

unread,
Mar 7, 2012, 9:39:46 AM3/7/12
to tesseract-ocr
Hi,

Is there any way to instruct tesseract via the command line to only
ocr specific pages of a multipage document. I know I could split the
file but I dont really want the extra overhead of doing so.

Dmitri Silaev

unread,
Mar 7, 2012, 12:33:28 PM3/7/12
to tesser...@googlegroups.com
No, at this time it is not possible to do via command line. However it
can be easily achieved by means of programming.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

Paul

unread,
Mar 7, 2012, 12:42:17 PM3/7/12
to tesseract-ocr
Thanks for the info. Do I assume then that it would be a fairly
trivial task for a developer to take the source code and provide some
extra command line switches to make it possible to specify a page or
page range?

On Mar 7, 5:33 pm, Dmitri Silaev <daemons2...@gmail.com> wrote:
> No, at this time it is not possible to do via command line. However it
> can be easily achieved by means of programming.
>
> Warm regards,
> Dmitri Silaevwww.CustomOCR.com
>
>
>
>
>
>
>
> On Wed, Mar 7, 2012 at 6:39 PM, Paul <pafow...@googlemail.com> wrote:
> > Hi,
>
> > Is there any way to instruct tesseract via the command line to only
> > ocr specific pages of a multipage document. I know I could split the
> > file but I dont really want the extra overhead of doing so.
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.> To post to this group, send email totesse...@googlegroups.com

Dmitri Silaev

unread,
Mar 8, 2012, 1:15:44 AM3/8/12
to tesser...@googlegroups.com
Sure, you just can post a feature request in the Issues section at the
project's web page.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> To post to this group, send email to tesser...@googlegroups.com

TP

unread,
Mar 8, 2012, 3:28:47 AM3/8/12
to tesser...@googlegroups.com
On Wed, Mar 7, 2012 at 9:33 AM, Dmitri Silaev <daemo...@gmail.com> wrote:
> No, at this time it is not possible to do via command line.

As a matter of fact with the SVN version of tesseract at least (and
probably earlier versions), it is possible to tell tesseract to OCR a
particular page in a multipage tiff file via the command line. For
example, run:

tesseract.exe example_multipage.tif page4 config-page.txt

where the config file, config-page.txt, only has the following in it:

tessedit_page_number 3

You'll see:

Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 4 of 5

and page4.txt will then contain the OCRed text of the fourth "page" in
example_multipage.tif.

So just dynamically create "config-page.txt" with the page # you want to OCR.

Dmitri Silaev

unread,
Mar 8, 2012, 3:47:22 AM3/8/12
to tesser...@googlegroups.com
My bad, I had missed that feature. "tessedit_page_number" indeed
allows to specify a TIFF page. I can only add a bit of clarification:
the page number is zero-based. The value of -1 (default) instructs
Tesseract to process all TIFF pages.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

Paul

unread,
Mar 8, 2012, 11:32:55 AM3/8/12
to tesseract-ocr
Thank you gents that will work for me, I will give it a try. Is there
somewhere I can find some documentation on things like config-page.txt
etc. I have Googled it but am not finding a whle lot of info.

Best Regards

Paul

On Mar 8, 8:47 am, Dmitri Silaev <daemons2...@gmail.com> wrote:
> My bad, I had missed that feature. "tessedit_page_number" indeed
> allows to specify a TIFF page. I can only add a bit of clarification:
> the page number is zero-based. The value of -1 (default) instructs
> Tesseract to process all TIFF pages.
>
> Warm regards,
> Dmitri Silaevwww.CustomOCR.com
>
>
>
>
>
>
>
> On Thu, Mar 8, 2012 at 12:28 PM, TP <wing...@gmail.com> wrote:
> > On Wed, Mar 7, 2012 at 9:33 AM, Dmitri Silaev <daemons2...@gmail.com> wrote:
> >> No, at this time it is not possible to do via command line.
>
> > As a matter of fact with the SVN version of tesseract at least (and
> > probably earlier versions), it is possible to tell tesseract to OCR a
> > particular page in a multipage tiff file via the command line. For
> > example, run:
>
> >   tesseract.exe example_multipage.tif page4 config-page.txt
>
> > where the config file, config-page.txt, only has the following in it:
>
> >  tessedit_page_number    3
>
> > You'll see:
>
> >  Tesseract Open Source OCR Engine v3.02 with Leptonica
> >  Page 4 of 5
>
> > and page4.txt will then contain the OCRed text of the fourth "page" in
> > example_multipage.tif.
>
> > So just dynamically create "config-page.txt" with the page # you want to OCR.
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.> To post to this group, send email totesse...@googlegroups.com

Dmitri Silaev

unread,
Mar 8, 2012, 2:11:24 PM3/8/12
to tesser...@googlegroups.com
TP used "config-page.txt" for the name of the config file, but you can
name it any way you like. A config file is a file of control
parameters used for tweaking Tesseract. You can find some e.g. in the
"tessdata/configs" directory, but also you can create your own.

As for existence and effects of specific parameters, currently I don't
any other way to find it out but digging in Tesseract's code. There's
also an ancient documentation at
http://tesseract-ocr.repairfaq.org/tess_variables_all.html but one
needs to explore if some parameter is still valid and the descriptions
are often obscure.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> To post to this group, send email to tesser...@googlegroups.com

TP

unread,
Mar 8, 2012, 3:06:30 PM3/8/12
to tesser...@googlegroups.com
On Thu, Mar 8, 2012 at 11:11 AM, Dmitri Silaev <daemo...@gmail.com> wrote:
> As for existence and effects of specific parameters, currently I don't
> any other way to find it out but digging in Tesseract's code.

If you are on Windows, I wrote this section on TCC/LE [1] that talks
about how you can use it's "ffind" command to display all (most?
some?) configuration parameters defined in the tesseract-ocr source
files (which is not the same thing as those parameters actually being
*used* to do anything). It also mentions how you can do something
similar with Visual Studio 2008, or the bash shell on Linux.

You can also put the following in a config file called, for example,
config-write-params.txt:

tessedit_write_params_to_file currentparams.txt
tessdata_manager_debug_level 1

(NOTE: this file *MUST* use unix style line endings, that is, only a
Linefeed character, *NOT* the window's convention: Carriage Return,
Linefeed).

Then do:

tesseract.exe eurotext.tif eurotext config-write-params.txt

You'll see:

Wrote parameters to currentparams.txt
Loading Tesseract/Cube with tessedit_ocr_engine_mode 0
Loaded unicharset
Loaded ambigs
Loaded language 'eng' as main language


Tesseract Open Source OCR Engine v3.02 with Leptonica

And looking at the newly created currentparams.txt you'll see something like:

textord_debug_tabfind 0
textord_debug_bugs 0
textord_testregion_left -1
...
textord_noise_hfract 0.015625
textord_noise_rowratio 6
textord_blshift_maxshift 0
textord_blshift_xfraction 9.99

(over 660 lines in my case). This file unfortunately is missing the
Description string that is listed in the source files, but otherwise
it gives a pretty good idea of what can be set. Searching the source
for a particular param will then provide insight into what it does.
For example with TCC/LE, try searching the source for
"tessedit_write_params_to_file":

ffind /s/v/t"tessedit_write_params_to_file" *.cpp

which gives:

---- TesseractSVN\ccmain\tessedit.cpp
if (((STRING &)tessedit_write_params_to_file).length() > 0) {
FILE *params_file = fopen(tessedit_write_params_to_file.string(), "wb");
tessedit_write_params_to_file.string());
tessedit_write_params_to_file.string());

---- TesseractSVN\ccmain\tesseractclass.cpp
STRING_MEMBER(tessedit_write_params_to_file, "",

5 lines in 2 files

Opening ccmain\tessedit.cpp, we then see the following:

if (((STRING &)tessedit_write_params_to_file).length() > 0) {
FILE *params_file = fopen(tessedit_write_params_to_file.string(), "wb");
if (params_file != NULL) {
ParamUtils::PrintParams(params_file, this->params());
fclose(params_file);
if (tessdata_manager_debug_level > 0) {
tprintf("Wrote parameters to %s\n",
tessedit_write_params_to_file.string());
}
} else {
tprintf("Failed to open %s for writing params.\n",
tessedit_write_params_to_file.string());
}
}

and ccmain\tesseractclass.cpp shows:

STRING_VAR_H(tessedit_write_params_to_file, "",
"Write all parameters to the given file.");

[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/tools.html#id2

TP

unread,
Mar 8, 2012, 3:21:04 PM3/8/12
to tesser...@googlegroups.com
On Thu, Mar 8, 2012 at 12:06 PM, TP <win...@gmail.com> wrote:
> you can use it's "ffind" command to display all (most?
> some?) configuration parameters defined in the tesseract-ocr source
> files

Addendum: The following TCC/LE ffind command gives a more "complete"
listing of parameters than the one given in [1]:

ffind /s/v/c/t"_VAR_H" *.h | list/s

(about 600). I haven't closely looked at this to figure out what the
differences are. I know from looking at the preprocessor output for
string parameters that the _VAR_H macro creates the member but doesn't
initialize it (despite the presence of an initial value arg). It's the
corresponding _MEMBER macro that actually inits the param. Maybe
params whose initial state is zero don't need to be initialialized any
further (although that seems like sloppy programming to me)?

[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/tools.html#id2

Paul

unread,
Mar 9, 2012, 1:33:56 PM3/9/12
to tesseract-ocr
Thank you TP and Dmitri for the additional info, very helpful.

One of the things I find most frustrating about some open source
projects is the lack of documentation that would help people like me
get a better understanding and ultimately is a barrier to wider
adoption and interest. This is not really a criticism but an
observation. With just a little fundamental info about how to get the
best out of/get started with Tesseract (and by that I mean users not
hard core developers) could go a long way to getting additional people
contributing and using it. I appreciate that guys like you make the
effort, thanks once again.

Paul

On Mar 8, 8:21 pm, TP <wing...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages