Warm regards,
Dmitri Silaev
www.CustomOCR.com
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
Warm regards,
Dmitri Silaev
www.CustomOCR.com
> To post to this group, send email to tesser...@googlegroups.com
As a matter of fact with the SVN version of tesseract at least (and
probably earlier versions), it is possible to tell tesseract to OCR a
particular page in a multipage tiff file via the command line. For
example, run:
tesseract.exe example_multipage.tif page4 config-page.txt
where the config file, config-page.txt, only has the following in it:
tessedit_page_number 3
You'll see:
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 4 of 5
and page4.txt will then contain the OCRed text of the fourth "page" in
example_multipage.tif.
So just dynamically create "config-page.txt" with the page # you want to OCR.
Warm regards,
Dmitri Silaev
www.CustomOCR.com
As for existence and effects of specific parameters, currently I don't
any other way to find it out but digging in Tesseract's code. There's
also an ancient documentation at
http://tesseract-ocr.repairfaq.org/tess_variables_all.html but one
needs to explore if some parameter is still valid and the descriptions
are often obscure.
Warm regards,
Dmitri Silaev
www.CustomOCR.com
> To post to this group, send email to tesser...@googlegroups.com
If you are on Windows, I wrote this section on TCC/LE [1] that talks
about how you can use it's "ffind" command to display all (most?
some?) configuration parameters defined in the tesseract-ocr source
files (which is not the same thing as those parameters actually being
*used* to do anything). It also mentions how you can do something
similar with Visual Studio 2008, or the bash shell on Linux.
You can also put the following in a config file called, for example,
config-write-params.txt:
tessedit_write_params_to_file currentparams.txt
tessdata_manager_debug_level 1
(NOTE: this file *MUST* use unix style line endings, that is, only a
Linefeed character, *NOT* the window's convention: Carriage Return,
Linefeed).
Then do:
tesseract.exe eurotext.tif eurotext config-write-params.txt
You'll see:
Wrote parameters to currentparams.txt
Loading Tesseract/Cube with tessedit_ocr_engine_mode 0
Loaded unicharset
Loaded ambigs
Loaded language 'eng' as main language
Tesseract Open Source OCR Engine v3.02 with Leptonica
And looking at the newly created currentparams.txt you'll see something like:
textord_debug_tabfind 0
textord_debug_bugs 0
textord_testregion_left -1
...
textord_noise_hfract 0.015625
textord_noise_rowratio 6
textord_blshift_maxshift 0
textord_blshift_xfraction 9.99
(over 660 lines in my case). This file unfortunately is missing the
Description string that is listed in the source files, but otherwise
it gives a pretty good idea of what can be set. Searching the source
for a particular param will then provide insight into what it does.
For example with TCC/LE, try searching the source for
"tessedit_write_params_to_file":
ffind /s/v/t"tessedit_write_params_to_file" *.cpp
which gives:
---- TesseractSVN\ccmain\tessedit.cpp
if (((STRING &)tessedit_write_params_to_file).length() > 0) {
FILE *params_file = fopen(tessedit_write_params_to_file.string(), "wb");
tessedit_write_params_to_file.string());
tessedit_write_params_to_file.string());
---- TesseractSVN\ccmain\tesseractclass.cpp
STRING_MEMBER(tessedit_write_params_to_file, "",
5 lines in 2 files
Opening ccmain\tessedit.cpp, we then see the following:
if (((STRING &)tessedit_write_params_to_file).length() > 0) {
FILE *params_file = fopen(tessedit_write_params_to_file.string(), "wb");
if (params_file != NULL) {
ParamUtils::PrintParams(params_file, this->params());
fclose(params_file);
if (tessdata_manager_debug_level > 0) {
tprintf("Wrote parameters to %s\n",
tessedit_write_params_to_file.string());
}
} else {
tprintf("Failed to open %s for writing params.\n",
tessedit_write_params_to_file.string());
}
}
and ccmain\tesseractclass.cpp shows:
STRING_VAR_H(tessedit_write_params_to_file, "",
"Write all parameters to the given file.");
[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/tools.html#id2
Addendum: The following TCC/LE ffind command gives a more "complete"
listing of parameters than the one given in [1]:
ffind /s/v/c/t"_VAR_H" *.h | list/s
(about 600). I haven't closely looked at this to figure out what the
differences are. I know from looking at the preprocessor output for
string parameters that the _VAR_H macro creates the member but doesn't
initialize it (despite the presence of an initial value arg). It's the
corresponding _MEMBER macro that actually inits the param. Maybe
params whose initial state is zero don't need to be initialialized any
further (although that seems like sloppy programming to me)?
[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/tools.html#id2