A vcproj file for building the traineddata files for 3.0

103 views
Skip to first unread message

SteveP

unread,
Sep 2, 2009, 8:32:39 PM9/2/09
to tesseract-ocr
I uploaded a vcproj file named combine.vcproj in the Files area for
Windows users for tesseract 3.0. Ray Smith said the code was there to
build the traineddata files. This builds that code. This vcproj goes
in the top folder where tesseract.vcproj is.

SteveP

unread,
Sep 3, 2009, 5:26:44 PM9/3/09
to tesseract-ocr
I was asked for more details. I will give what I can for those
interested, but I think there may still be some other information we
need from Ray Smith.

First, copy the file combine.vcproj to the same folder that has
tesseract.sln in it.
(If you like, you may make a copy of tesseract.sln as a backup, since
the steps below update that file.)
To update that file, you may add combine.vcproj to the tesseract
solution as follows. With this solution open in Visual Studio 2008,
rick-click on "Solution 'teseract'" in the Solution Explorer pane,
select Add,
select Existing Project...

In the dialog box that comes up, navigate to the folder that has
combine.vcproj in it,
select this vcproj file and click on Open.
If this worked, you should see "combine" as a new project in the
solution.

You may right click on this new project and select Build to build it
and produce
combine.exe. To run this exe, it needs to run with the working
directory set to the folder that has the tessdata folder in it. The
easiest way to do this is to copy the exe to that folder if it is not
already there, but if your exe file is in the bin.dbg folder, you can
alternatively follow Ray Smith's suggestion to copy the tessdata
folder and the dlls to bin.dbg.

To run combine.exe after preparing the necessary files (see below),
follow the Usage in the source code:
"Usage: %s language_data_path_prefix (e.g. tessdata/eng.)",
which means the following command for english:
combine tessdata/eng.
Here the final period is part of the command since the input files for
combine include that period. It is part of language_data_path_prefix.

So what are the necessary files, and which files are optional?
To quote the source code, the file paths are a concatenation (as in
strcat) of
language_data_path_prefix and the suffixes shown below. Note that
some of the names are
different from 2.04. Except for the unicharset file, all of the files
are optional as far as combine is concerned. Thus for English,
tessdata/eng.unicharset is a required file, and files such as
tessdata/eng.inttemp would also come from training just as before
tesseract 3.0.

As of 9-3-2009, there appear to still be missing details or
instructions on some of the new files, such as punc-dawg. (FYI, Ray
Smith is the expert, I just am summarizing what I see in the source
code.)

//Suffixes of input files (most optional) used to build traineddata
file.
static const char kLangConfigFileSuffix[] = "config";
static const char kUnicharsetFileSuffix[] = "unicharset";
static const char kAmbigsFileSuffix[] = "unicharambigs";
static const char kBuiltInTemplatesFileSuffix[] = "inttemp";
static const char kBuiltInCutoffsFileSuffix[] = "pffmtable";
static const char kNormProtoFileSuffix[] = "normproto";
static const char kPuncDawgFileSuffix[] = "punc-dawg";
static const char kSystemDawgFileSuffix[] = "word-dawg";
static const char kNumberDawgFileSuffix[] = "number-dawg";
static const char kFreqDawgFileSuffix[] = "freq-dawg";

SteveP

unread,
Sep 3, 2009, 8:39:39 PM9/3/09
to tesseract-ocr
This vcproj does not build in Release, so only try building in Debug.
> > in the top folder where tesseract.vcproj is.- Hide quoted text -
>
> - Show quoted text -

74yrs old

unread,
Sep 4, 2009, 4:34:04 AM9/4/09
to tesser...@googlegroups.com, SteveP
SteveP,
Appreciated for the detailed instructions how to generate combine.exe. Thanks for the same.
I followed your guidance

> "rick-click on "Solution 'teseract'" in the Solution Explorer pane,
> select Add,
> select Existing Project...
>
> In the dialog box that comes up, navigate to the folder that has
> combine.vcproj in it,
> select this vcproj file and click on Open.
> If this worked, you should see "combine" as a new project in the
> solution."
As a result of the re-compilation (Build batch -select all -Clean -rebuild all) in VC++2008

cntraining - 0 error(s), 11 warning(s)
========== Rebuild All: 35 succeeded, 1 failed, 0 skipped ==========
Note: I could not understand "1 failed" - which one failed?

In the bin.dbg = 7exe files generated including combine.exe appeared..
In the Main folder = 6exe(release) generated. Copied combine.exe from bin.dbg and
pasted under Main folder. Thus total 7 exe files[6 exe release +one exe dbg] existed.

Tested tesseract photest.tif  phtest logfile =  phtest.txt reproduced correctly from tif file.

Regarding generating  combine.exe: As per your guidance
">To run this exe, it needs to run with the working
> directory set to the folder that has the tessdata folder in it.  The
> easiest way to do this is to copy the exe to that folder"
Whether copy "combine.exe" found in bin.dbg can be pasted into folder "tessdata" ?
Because I don't  know which are files of DLLs  to copied into bin.dbg.
It is presumed that six files of DLLs are of Lepton  like Jpeg62.dll, libimage.dll, librle3.dll, leptonlib.dll,
libpng13.dll, libtiff3.dll. plus "tessdata" folder have to be copied into bin.dbg.

Further, It is presumed that to run combine.exe - the command line( example for English datafiles)
 should be as follows:
 " combine  tessdata/eng.freq-dawg, tessdata/eng.user-words, tessdata/eng.word-dawg,
tessdata/eng.inttemp, tessdata/eng.normproto, tessdata/eng.pffmtable, tessdata/eng.unicharset,
tessdata/eng.DangAmbigs (output)eng.traineddata "

Kindly confirm above my presumptions.

With Regards,
-sriranga(76yrs old)


74yrs old

unread,
Sep 4, 2009, 11:37:16 AM9/4/09
to Pohorsky, Steve, tesser...@googlegroups.com
Dear Steve,
I am extremely thankful to you for your valuable clarification. But still  I could not understand/confusion re:
"command line is just what is between quotes here: “combine  tessdata/eng"

It is presumed that all datafiles-without prefix like eng. should be generated as usual - as done in tesseract 2.04 and then run single line commandline as follows: " combine tessdata/eng " for tesseract 3.0.

Kindly excuse me for giving you trouble, since I am not programmer nor developer.

With Warmest Regards,
-sriranga(76yrsold)




On Fri, Sep 4, 2009 at 8:22 PM, Pohorsky, Steve <SPoh...@sjm.com> wrote:

>>see below.  Will also post parts to tesseract group.

 

Steve Pohorsky

Tel +1 818 493 3432

Fax +1 818 362 5851


From: 74yrs old [mailto:withbl...@gmail.com]
Sent: Friday, September 04, 2009 1:34 AM
To: tesser...@googlegroups.com
Cc: Pohorsky, Steve


Subject: Re: A vcproj file for building the traineddata files for 3.0

 

SteveP,


Appreciated for the detailed instructions how to generate combine.exe. Thanks for the same.
I followed your guidance
> "rick-click on "Solution 'teseract'" in the Solution Explorer pane,
> select Add,
> select Existing Project...
>
> In the dialog box that comes up, navigate to the folder that has
> combine.vcproj in it,
> select this vcproj file and click on Open.
> If this worked, you should see "combine" as a new project in the
> solution."
As a result of the re-compilation (Build batch -select all -Clean -rebuild all) in VC++2008

cntraining - 0 error(s), 11 warning(s)
========== Rebuild All: 35 succeeded, 1 failed, 0 skipped ==========
Note: I could not understand "1 failed" - which one failed?

>>click in the Output pane.  do a Find for “error”.



In the bin.dbg = 7exe files generated including combine.exe appeared..
In the Main folder = 6exe(release) generated. Copied combine.exe from bin.dbg and
pasted under Main folder. Thus total 7 exe files[6 exe release +one exe dbg] existed.

Tested tesseract photest.tif  phtest logfile =  phtest.txt reproduced correctly from tif file.

Regarding generating  combine.exe: As per your guidance
">To run this exe, it needs to run with the working
> directory set to the folder that has the tessdata folder in it.  The
> easiest way to do this is to copy the exe to that folder"
Whether copy "combine.exe" found in bin.dbg can be pasted into folder "tessdata" ?

>> not into tessdata, but into the folder above it, the one that contains tessdata folder.


Because I don't  know which are files of DLLs  to copied into bin.dbg.
It is presumed that six files of DLLs are of Lepton  like Jpeg62.dll, libimage.dll, librle3.dll, leptonlib.dll,
libpng13.dll, libtiff3.dll. plus "tessdata" folder have to be copied into bin.dbg.

>> I was referring to what Ray S wrote in the README in the wiki site, ‘all DLLs except tessdll”.


Further, It is presumed that to run combine.exe - the command line( example for English datafiles)
 should be as follows:
 " combine 
tessdata/eng.freq-dawg, tessdata/eng.user-words, tessdata/eng.word-dawg,
tessdata/eng.inttemp, tessdata/eng.normproto, tessdata/eng.pffmtable, tessdata/eng.unicharset,
tessdata/eng.DangAmbigs (output)eng.traineddata "

>> no, command line is just what is between quotes here: “combine  tessdata/eng.”

>>all of the suffixes are in the source code; that is why they are not specified on command line.

>>Note that “DangAmbigs” is the old name. For 3.0 tesseract source code for combine (I did not write it) uses “unicharambigs”.

 
This communication, including any attachments, may contain information that is proprietary, privileged, confidential or legally exempt from disclosure. If you are not a named addressee, you are hereby notified that you are not authorized to read, print, retain a copy of or disseminate any portion of this communication without the consent of the sender and that doing so may be unlawful. If you have received this communication in error, please immediately notify the sender via return e-mail and delete it from your system.

SteveP

unread,
Sep 4, 2009, 6:52:15 PM9/4/09
to tesseract-ocr
Here are a few more details from an email exchange for those that are
interested:

command line is just what is between quotes here: “combine tessdata/
eng."

I forgot to mention that the source code for combine, which I have not
modified, expects the “eng.” to be added. So, people like me and you
have to copy and rename the files into tessdata, such as :

copy normproto tessdata\eng.normproto

Note that the name "combine" might not be permanent. Ray Smith might
produce something else in the official distribution when he gets to
wrapping up the training implementation.

On Sep 4, 8:37 am, 74yrs old <withblessi...@gmail.com> wrote:
> Dear Steve,
> I am extremely thankful to you for your valuable clarification. But still  I
> could not understand/confusion re:
> "command line is just what is between quotes here: “combine  tessdata/eng"
>
> It is presumed that *all datafiles-without prefix like eng*. should be
> generated as usual - as done in tesseract 2.04 and then *run single line
> commandline* as follows: " *combine tessdata/eng* " for tesseract 3.0.
>
> Kindly excuse me for giving you trouble, since I am not programmer nor
> developer.
>
> With Warmest Regards,
> -sriranga(76yrsold)
>
>
>
> On Fri, Sep 4, 2009 at 8:22 PM, Pohorsky, Steve <SPohor...@sjm.com> wrote:
> >  >>see below.  Will also post parts to tesseract group.
>
> > *Steve Pohorsky*
>
> > Tel +1 818 493 3432
>
> > Fax +1 818 362 5851
>
> > *spohor...@sjm.com <spspohor...@sjm.com>***
> >   ------------------------------
>
> > *From:* 74yrs old [mailto:withblessi...@gmail.com]
> > *Sent:* Friday, September 04, 2009 1:34 AM
> > *To:* tesser...@googlegroups.com
> > *Cc:* Pohorsky, Steve
> > *Subject:* Re: A vcproj file for building the traineddata files for 3.0
> > system.- Hide quoted text -

74yrs old

unread,
Sep 5, 2009, 6:17:43 AM9/5/09
to tesser...@googlegroups.com
Steve,
Congratulations!!
Tested and successfully generated  <lang>.traineddata using combine.exe in tesseract 3.0.
I am very much thankful to you.
-sriranga(76yrsold)
Reply all
Reply to author
Forward
0 new messages