Enhancing the tesseract 3.01 Visual Studio 2008 distribution

1,726 views
Skip to first unread message

Tom Powers

unread,
Nov 5, 2011, 4:58:26 AM11/5/11
to tesseract-dev
zdenko podobny said in http://groups.google.com/group/tesseract-dev/browse_thread/thread/d5fdbe158c897089
that the new tesseract 3.01 release is missing a "one library
solution" and a DLL on Windows. He asked for help from the community.

I am the "official" maintainer of the Leptonica Windows releases, so I
thought I might give this a shot.


*Caveats*

1) I have only really been using tesseract for a day or so (and I
don't actually plan on using it very much in the future).

2) I don't consider myself a Visual Studio 2008/2010 expert by any
means. I have reacquainted myself with C while handling the last 5 or
so Leptonica Win32 releases, but my C++ is pretty rusty.

3) My testing of tesseract is pretty rudimentary. I just do:

tesseract.exe eurotext.tif eurotext

and make sure that tesseract doesn't crash and I get "something" in
eurotext.txt.


*Research Phase*

By flailing around I was able to demonstrate that it is possible to
build a DLL version of tesseract3.01 using Visual Studio 2008 on
Windows 7. This involved having to manually edit various vcproj files
in order to get VS2008 to do what I wanted it to do. The resulting
VS2008 solution wasn't very pretty but the .exe did run correctly.

This post will document my observations, solicit answers to some
questions, and opinions on my tentative plans.

First of all the organization of the existing VS2008 files is
somewhat difficult to grasp. There are 20 "projects" whose
interactions are non-obvious. Here's my understanding of what needs to
be done to create libtesseract.

libtesseract would consist of the following 12 existing "sub-
library" projects:

ccmain
ccstruct
ccutil
classify
cube
cutil
dict
image
neural_networks
textord
viewer
wordrec

plus the following files:

baseapi.h
pageiterator.h
resultiterator.h

baseapi.cpp
pageiterator.cpp
resultiterator.cpp

(I'm considering getting rid of all the sub-library projects. See
my questions at the end of this post).

The following libraries would also be created:

libtesseract_tessopt
libtesseract_training

The following executables would be created (with the required
libraries listed):

tesseract.exe
libtesseract

cntraining.exe
libtesseract
libtesseract_tessopt
libtesseract_training

combine_tessdata.exe
libtesseract (just ccutil is used)

mftraining.exe
libtesseract
libtesseract_tessopt
libtesseract_training

unicharset_extractor
libtesseract (just ccutil is used)
libtesseract_tessopt

wordlist2dawg
libtesseract


*Some observations*

From the Configuration dropdown in the toolbar I only see:

Debug
Release.dynamic
Release.static

However if you bring up the Tesseract Property Pages only the
following show up in the "Configuration:" dropdown:

Debug (links with libleptd.dll)
Release (links with liblept-static.lib)
Release.dynamic (links with liblept.dll)

This seems to indicate to me that something weird is going on with
the Build Configuration settings in the various .vcproj files (or
perhaps the tesseract.sln file). Maybe this is the result of incorrect
manual editing? I've never seen a VS2008 solution that displayed this
behavior.

There is no configuration for building a Debug version that
statically links with Leptonica. (And as mentioned earlier there are
no configurations for building DLL versions of libtesseract).

---

When building libtesseract, baseapi.cpp needs to be compiled with
TESSDLL_EXPORTS defined to create the DLL version. In addition the
ccutil project needs to define CCUTIL_EXPORTS for the DLL version
(otherwise you get a bunch of undefined externals for the STRING
class).

When linking with the DLL, TESSDLL_IMPORTS and CCUTIL_IMPORTS need
to be defined when including baseapi.h.


*Suggested Build Configurations*

EVERYTHING will be built with the Multi Threaded DLL C runtime (/MD
or /MDd).

The static version of libtesseract301 will be linked against the
STATIC version of liblept168. This will make using the static version
of libtesseract more complicated (since you also have to explicitly
link to:

zlib125-static-mtdll.lib
libpng143-static-mtdll.lib
libjpeg8c-static-mtdll.lib
libtiff394-static-mtdll.lib
giflib416-static-mtdll.lib
liblept168-static-mtdll.lib

(I plan to use the leptonica_versionnumbers.vsprops file to make it
a bit easier to isolate tesseract.sln from the various version
numbers)

The DLL version of libtesseract needs to be linked to the DLL
version of Leptonica (to give people access to the underlying
Leptonica functionality). I will define "_USRDLL" so the library can
determine if it is being built as a DLL.

Note: this means that direct access to functions from the
underlying image libraries (not necessarily a good idea) would only be
possible with the static version of libtesseract.

libtesseract301 will be built as a static library and a DLL
(release and debug versions). The following Build Configurations will
be created (mirroring what I do for Letptonica):

DLL Debug
DLL Release
LIB Debug
LIB Release

The only valid missing configuration is a static libtesseract
linked against the Leptonica DLL. I'm assuming this is not worth
bothering with?

I will probably get rid of the bin.debug & bin.release output
directories and just use the standard $(OutDir)\$(ProjectName).dll
style (where OutDir is the same as ConfigurationName).

I will probably switch over to using the compiler settings I use
when building Leptonica (http://tpgit.github.com/UnOfficialLeptDocs/
vs2008/building-liblept.html) except of course for the Leptonica
specific preprocessor definitions.

I am not planning on updating the existing VS2010 solution. The
same procedure discussed at http://tpgit.github.com/UnOfficialLeptDocs/vs2008/vs2010-notes.html
could be used to convert my tesseract VS2008 solution.

I'm strongly considering changing the build directory structure to
use include and lib directories in the same folder as tesseract-3.01.
This matches what I suggest when building Leptonica (http://
tpgit.github.com/UnOfficialLeptDocs/vs2008/directory-
organization.html). I would also create Post-Build Event commands to
automatically copy the libraries to this directory. That makes it
easier to build leptonica and presumably other apps that need to link
with libtesseract. The directory structure would be:

BuildFolder/
include/
leptonica/
tesseract/
lib/
leptonica-1.68/
tesseract-3.01/

I could supply a tesseract-ocr-3.01-win32-lib-include-dirs.zip
similar to my leptonica-1.68-win32-lib-include-dirs.zip that would
also contain the tesseract libraries and include files. (See my
question on the header files later on). However, I am NOT volunteering
to maintain future tesseract-ocr Win32 releases.


*Suggested library names*

static libraries:

libtesseract301-static.lib
libtesseract301-static-debug.lib

DLLs:

libtesseract301.lib (import library)
libtesseract301.dll
libtesseract301d.lib (import library)
libtesseract301d.dll


*Questions*

Are the 12 sub-library static libraries still necessary? The
solution would be much less complicated (especially when converting to
VS2010) if you just have one project that builds the entire library
using all the source files and headers from the existing projects. I
tried to do this by copying all the files to a new project but got an
error that indicated that some file name collisions occurred. (I'll
have to write a short python script to figure out the duplicate
names).

When I "Clean" the tesseract project, all the sub-library projects
also get Cleaned? I don't know if this is some strange result of
having tesseract be dependent on the libraries? I don't notice this
happening when I'm building the Leptonica example programs. Given the
other slightly strange behavior I'm seeing, I'm leaning towards just
creating a new tesseract.sln from scratch.

It appears that DLLSYM is no longer really needed? The only things
that should be exported to the DLL are now marked with TESSDLL_EXPORTS
instead? So ccstruct/hpddef.h can be removed?

For my trial DLL I removed the inclusion of cctutil/notdll.h from
tesseract/tesseractmain.h. However, maybe that isn't necessary since
it only sets DLLSYM?

Apps built against these new libraries will also, of course, need
access to the include files. This isn't such a big deal with Leptonica
since there are only a few. However, libtesseract seems to require
about 265 header files? I suppose I could write a short python script
to automatically copy all the headers from the 12 sub-library
directories to a "main" include/tesseract directory (similar to my
include/leptonica directory).

Shouldn't baseapi.h (instead of baseapi.cpp) include
resultiterator.h? Apps using libtesseract should really only have to
include one file (or two if they also want to use Leptonica functions).

Zdenko Podobný

unread,
Nov 6, 2011, 7:45:13 AM11/6/11
to tesser...@googlegroups.com
Hi Tom,

first of all thanks for you help. I do not have answers for all your questions, but I would like to share what I know.

Honestly I installed VC++ 2008 just to fix tesseract 3.00 (at that time it was not possible to compile tesseract in VC++). This is maybe reason why project files/solution is in the current stage... There are some relics of old (2.04) tessdll.dll (TESSDLL_IMPORTS, CCUTIL_IMPORTS ??) that was not updated to 3.0x versions...

Some informations are below...

Zdenko
I think - untill we keep build system separately from tesseract source (as it is now - everything windows related is in vs2008/vs2010 directories) you can change directory (vs2008) structure.

   I could supply a tesseract-ocr-3.01-win32-lib-include-dirs.zip
similar to my leptonica-1.68-win32-lib-include-dirs.zip that would
also contain the tesseract libraries and include files. (See my
question on the header files later on). However, I am NOT volunteering
to maintain future tesseract-ocr Win32 releases.


*Suggested library names*

   static libraries:

      libtesseract301-static.lib
      libtesseract301-static-debug.lib

   DLLs:

      libtesseract301.lib  (import library)
      libtesseract301.dll
      libtesseract301d.lib (import library)
      libtesseract301d.dll
ok.


*Questions*

   Are the 12 sub-library static libraries still necessary? The
solution would be much less complicated (especially when converting to
VS2010) if you just have one project that builds the entire library
using all the source files and headers from the existing projects. I
tried to do this by copying all the files to a new project but got an
error that indicated that some file name collisions occurred. (I'll
have to write a short python script to figure out the duplicate
names).
it is called "convenience libraries" ( http://www.gnu.org/s/hello/manual/automake/Libtool-Convenience-Libraries.html#Libtool-Convenience-Libraries). Based on my experiences from 3.01 you can put all (excluding training directory and tesseractmain.h, tesseractmain.cpp  ) to one project file). There is at least one project ( http://maxmods.googlecode.com/svn/trunk/tesseract.mod/) that already did it with tesseract.

What I heard from Ray Smith (developer of tesseract) training part will be improved with dependencies to other libraries in next version. So it make sense to me to keep training separately.


   When I "Clean" the tesseract project, all the sub-library projects
also get Cleaned? I don't know if this is some strange result of
having tesseract be dependent on the libraries? I don't notice this
happening when I'm building the Leptonica example programs. Given the
other slightly strange behavior I'm seeing, I'm leaning towards just
creating a new tesseract.sln from scratch.

   It appears that DLLSYM is no longer really needed? The only things
that should be exported to the DLL are now marked with TESSDLL_EXPORTS
instead? So ccstruct/hpddef.h can be removed?

   For my trial DLL I removed the inclusion of cctutil/notdll.h from
tesseract/tesseractmain.h. However, maybe that isn't necessary since
it only sets DLLSYM?
I think this (DLLSYM) is relic from tessdll... And I am not sure if ALL things are marked with TESSDLL_EXPORTS.
   Apps built against these new libraries will also, of course, need
access to the include files. This isn't such a big deal with Leptonica
since there are only a few. However, libtesseract seems to require
about 265 header files? I suppose I could write a short python script
to automatically copy all the headers from the 12 sub-library
directories to a "main" include/tesseract directory (similar to my
include/leptonica directory).
Maybe Ray can give more light to this. On linux all 265 header are installed ;-)

Tom Powers

unread,
Nov 6, 2011, 6:06:13 PM11/6/11
to tesseract-dev
I've put tesseract-vs2008-3.01-2011-11-06.zip, my attempt at a Visual
Studio 2008 Solution that builds tesseract-3.01 (with both DLL and
static versions of libtesseract) at http://www.mediafire.com/?2qt5b1s1si6qb42
(I gather you can no longer add attachments to Google Groups
discussions while using their web interface?).

I wrote a README which contains a detailed description of the Solution
and how to use it. It's in vs2008\readme.txt. At the very least,
someone will need to fix "<<<PUT ACTUAL DOWNLOAD URL HERE>>>" when the
final URL is determined for the zip file.

Some notes:

1) I built this Solution from scratch (and by copying some things from
my Leptonica VS2008 solution). I used a short Python script to
create .vcproj compatible entries for all header and source files that
were used in the original 12 libtesseract sub-library Projects
mentioned in my first post in this thread. I then had to add a few
other files manually from other directories (like api and
vs2008\port).

2) NONE of the tesseract-3.01 files were changed in any way. I'm only
supplying an entirely different vs2008 directory (and a few files in a
parent include directory -- see my readme.txt for details).

3) I decided to merge all the libraries into one libtesseract. The
"training" libs were just single files and it didn't seem worth the
bother to separate them. (This also will make it a bit easier for
someone to use the .sln file in VS2010 since during that conversion
process each separate Project needs to be manually fixed). Instead of
20 projects there are now only 7 (and 5 of those are for the
"optional" training applications.

4) I turn off the following warnings (otherwise you get a flood):

/wd4005: 'snprintf' : macro redefinition

/wd4018 'expression' : signed/unsigned mismatch

/wd4244 conversion from 'double' to 'float', possible loss of data

/wd4355: 'this' : used in base member initializer list

/wd4267 conversion from 'size_t' to 'type', possible loss of data

/wd4305 truncation from 'type1' to 'type2'

/wd4800: forcing value to bool 'true' or 'false' (performance
warning)

/wd4996 'function': was declared deprecated

Someone might want to re-enable these warnings and take a look to
see if any of them point at real issues. You can see more information
on these warnings at "C/C++ Build Errors" (http://msdn.microsoft.com/
en-us/library/8x5x43k7(v=VS.90).aspx).

5) I use the "C7 Compatible" Debug Information (/Z7) compiler switch,
This puts the debugging information in the .obj files. That way I
don't have to worry about also sending .pdb files. See "/Z7, /Zi, /ZI
(Debug Information Format)" (http://msdn.microsoft.com/en-us/library/
958x11bc(VS.90).aspx) for more information.

6) I didn't test any of the training apps other than to make sure they
ran and printed out an opening message without crashing. Only minimal
testing of tesseract.exe has been done.

7) Someone needs to fix all the project's version resources (double-
click on a project's .rc file, and then double-click on the
VS_VERSION_INFO resource to edit). These are used to show version
information on the Detail tab of Windows Explorer's Properties page
for .exe and .dll files. This lets people easily see a description and
version information for any exe or dll. Currently they all say the
same thing (and even then the description is just an example and is
probably wrong).

I didn't manage to figure out how to get the RC Resource Compiler
to use my tesseract_versionnumbers.vsprops Property Sheet, and have
all the .rc files automatically update their version numbers. This
probably requires some sort of voodoo to take MYVERSION="$
(LIBTESS_VERSION_R)" specified as a "Preprocessor Definitions"
Configuration Property to the RC command, and have that converted to
both 3,1,0,0 and "3,1,0,0" within the actual .rc files? I tried using
the "#" & "##" preprocessor operators but couldn't get it to work. Oh
well, there are only 7 resources that would have to be updated each
time a new version of tesseract is released.

---

Once people take a look at this and approve it, I can supply an
initial version of the include and lib directories that are needed to
build applications that link with libtesseract for people who don't
want to build it themselves (similar to what I now provide for
Leptonica).

I'll also supply a python script that copies all header files
mentioned in libtesseract.vcprog to a specified directory (preserving
the directory structure).

(But for now I need to take a nap :P )

Ray Smith

unread,
Nov 8, 2011, 1:46:37 AM11/8/11
to tesser...@googlegroups.com
Hi Tom,

Since the original email is so long, and has a lot of questions, I am going to answer those inline first, and then take a look at the newer messages:

It doesn't matter to me, since I longer use VC++. It was convenient at the time because VC++ didn't have any other way of dealing with subdirectiores, and it enabled like source files to be grouped. I think it might be able to do the grouping now anyway.
For the longer term, it might make life simpler to have one big library and the tesseract executable in one solution, and the training programs in another solution. This for the simple reason that the training programs eventually are going to have a long list of other dependencies that I wouldn't want every developer to have to deal with, especially as windows support for those training programs is likely to lag behind linux support.

  When I "Clean" the tesseract project, all the sub-library projects
also get Cleaned? I don't know if this is some strange result of
having tesseract be dependent on the libraries? I don't notice this
happening when I'm building the Leptonica example programs. Given the
other slightly strange behavior I'm seeing, I'm leaning towards just
creating a new tesseract.sln from scratch.

  It appears that DLLSYM is no longer really needed? The only things
that should be exported to the DLL are now marked with TESSDLL_EXPORTS
instead? So ccstruct/hpddef.h can be removed?
DLLSYM is a hangover  from pre open-source days and is obsolete.

  For my trial DLL I removed the inclusion of cctutil/notdll.h from
tesseract/tesseractmain.h. However, maybe that isn't necessary since
it only sets DLLSYM?
DLLSYM is obsolete and just needs defining to nothing or deleting from everywhere.

  Apps built against these new libraries will also, of course, need
access to the include files. This isn't such a big deal with Leptonica
since there are only a few. However, libtesseract seems to require
about 265 header files? I suppose I could write a short python script
to automatically copy all the headers from the 12 sub-library
directories to a "main" include/tesseract directory (similar to my
include/leptonica directory).
You should be able to get away with the headers in the api directory plus ccstruct/publictypes.h and a few includes listed in baseapi.h.
OK, there is a problem. genericvector.h should not be included in baseapi.h. You can convert it to a declaration.

  Shouldn't baseapi.h (instead of baseapi.cpp) include
resultiterator.h? Apps using libtesseract should really only have to
include one file (or two if they also want to use Leptonica functions).
Well yes that's a nice idea, but I would suggest a single include over in your windows directory that just includes the necessary includes (via a relative path) so the application doesn't have to have loads of -I directives to tell the compiler where to find them, and there is no need to copy the normal headers.

Naveen

unread,
Nov 8, 2011, 1:33:36 AM11/8/11
to tesseract-dev
> I've put tesseract-vs2008-3.01-2011-11-06.zip, my attempt at a Visual
> Studio 2008 Solution that builds tesseract-3.01 (with both DLL and
> static versions of libtesseract)
> Once people take a look at this and approve it, I can supply an
> initial version of the include and lib directories that are needed to
> build applications that link with libtesseract for people who don't
> want to build it themselves (similar to what I now provide for
> Leptonica).
> (But for now I need to take a nap :P )

Thank you so much Tom. I learned more about tesseract from your 2
posts than the hours i had spent on it before. I downloaded your zip
file and followed your instructions and I can confirm they work.
I put the resulting directory structure minus the binary .lib and .dll
files at github:
https://github.com/tinku99/tesseract-ocr
hope you had a nice, well deserved nap.

Ray Smith

unread,
Nov 8, 2011, 2:00:53 AM11/8/11
to tesser...@googlegroups.com
Last time looked at the windows warnings they were pretty much all pointless.
That seems a bit excessive.  There really should only be very few headers needed, but you have to get rid of genericvector.h as that drags in a lot f unnecessary stuff.

Tom Powers

unread,
Nov 8, 2011, 4:20:21 AM11/8/11
to tesseract-dev
On Nov 7, 10:46 pm, Ray Smith <theraysm...@gmail.com> wrote:
> It doesn't matter to me, since I longer use VC++. It was convenient at the
> time because VC++ didn't have any other way of dealing with subdirectiores,
> and it enabled like source files to be grouped. I think it might be able to
> do the grouping now anyway.

That was the main negative result of going to the One Library
approach. All the header files are lumped together, and the .cpp files
are in another single large section. This makes it pretty difficult to
see any organization at all among the many files.

I was considering using Visual Studio "Solution Folders" as another
way to group all the files beside the Projects. I would have made a
Solution Folder for each former sub-library and added all the files
from its corresponding directory. I do something like this for the
programs in Leptonica's /prog directory. Unfortunately, Solution
Folders are only supported on VS2008 NOT the free VC++ 2008 Express
Edition. So in the end I decided to not bother.

While, grouping the files would have been nice, the real answer to
helping people understand libtesseract's architecture is better
documentation :P For now the tesseract-3.01 directory hierarchy
continues to provide at least a little order to the chaos.

> For the longer term, it might make life simpler to have one big library and
> the tesseract executable in one solution, and the training programs in
> another solution. This for the simple reason that the training programs
> eventually are going to have a long list of other dependencies that I
> wouldn't want every developer to have to deal with, especially as windows
> support for those training programs is likely to lag behind linux support.

Okay. It will be trivial to split out the training pieces whenever
that becomes necessary.

> DLLSYM is obsolete and just needs defining to nothing or deleting from
> everywhere.

Probably should be added as a low priority issue. All the DLLSYMs do
clutter up the codebase.

> You should be able to get away with the headers in the api directory plus
> ccstruct/publictypes.h and a few includes listed in baseapi.h.
> OK, there is a problem. genericvector.h should not be included in
> baseapi.h. You can convert it to a declaration.

> >   Shouldn't baseapi.h (instead of baseapi.cpp) include
> > resultiterator.h? Apps using libtesseract should really only have to
> > include one file (or two if they also want to use Leptonica functions).

> Well yes that's a nice idea, but I would suggest a single include over in
> your windows directory that just includes the necessary includes (via a
> relative path) so the application doesn't have to have loads of -I
> directives to tell the compiler where to find them, and there is no need to
> copy the normal headers.

Okay, I manually looked at this and these are the header dependencies
I see:

#include "baseapi.h" (in api)
#include "apitypes.h" (in api)
#include "publictypes.h" (in ccstruct)
#include "genericvector.h" (in ccutil)
#include "tesscallback.h" (in ccutil)
#include "errcode.h" (in ccutil)
#include "host.h" // For NULL. (in ccutil)
#include "helpers.h" (in ccutil)
#include "ndminx.h" (in ccutil)
#include "thresholder.h" (in ccmain)
#include "unichar.h" (in ccutil)
#include "tesscallback.h" (in ccutil)
#include "host.h" // For NULL. (in ccutil)
#include "platform.h" (in ccutil)

#include "resultiterator.h" (in api)
#include "pageiterator.h" (in api)
#include "apitypes.h" (in api)

Unless I missed something that's quite manageable. What is the problem
with genericvector.h?

Tom Powers

unread,
Nov 8, 2011, 4:33:51 AM11/8/11
to tesseract-dev
On Nov 7, 10:33 pm, Naveen <naveen.g...@gmail.com> wrote:
> Thank you so much Tom.  I learned more about tesseract from your 2
> posts than the hours i had spent on it before.   I downloaded your zip
> file and followed your instructions and I can confirm they work.
> I put the resulting directory structure minus the binary .lib and .dll
> files at github:https://github.com/tinku99/tesseract-ocr
> hope you had a nice, well deserved nap.

Thanks for confirming that you were able to successfully build. I'm
not sure there needs to be yet another tesseract repository though :)
Eventually, if my solution is approved, I hope someone will just check
it into the "official" svn repository.

Then someone (actually it's probably best if it's NOT me since I'm not
an experienced tesseract user) would then package up the include and
lib directories in a zip file for those people who just want to link
to libtesseract.

Tom Powers

unread,
Nov 8, 2011, 4:43:39 AM11/8/11
to tesseract-dev
On Nov 7, 11:00 pm, Ray Smith <theraysm...@gmail.com> wrote:
> On Sun, Nov 6, 2011 at 3:06 PM, Tom Powers <tomp2...@gmail.com> wrote:
> > 4) I turn off the following warnings (otherwise you get a flood):
>
> >   /wd4005: 'snprintf' : macro redefinition
>
> >   /wd4018 'expression' : signed/unsigned mismatch
>
> >   /wd4244 conversion from 'double' to 'float', possible loss of data
>
> >   /wd4355: 'this' : used in base member initializer list
>
> >   /wd4267 conversion from 'size_t' to 'type', possible loss of data
>
> >   /wd4305 truncation from 'type1' to 'type2'
>
> >   /wd4800: forcing value to bool 'true' or 'false' (performance
> > warning)
>
> >   /wd4996 'function': was declared deprecated
>
> Last time looked at the windows warnings they were pretty much all
> pointless.

In general I agree, but I've found a number of bugs in Leptonica by
looking at the Visual C++ warnings. Disabling selective warnings let's
possible actual errors stand out. In fact, you should take a look at
issue 573 (http://code.google.com/p/tesseract-ocr/issues/detail?
id=573). That looks like a real error to me?

> > I'll also supply a python script that copies all header files
> > mentioned in libtesseract.vcprog to a specified directory (preserving
> > the directory structure).
>
> That seems a bit excessive.  There really should only be very few headers
> needed, but you have to get rid of genericvector.h as that drags in a lot f
> unnecessary stuff.

I took a look at genericvector.h and it seems to do:

#include "genericvector.h" (in ccutil)
#include "tesscallback.h" (in ccutil)
#include "host.h" // For NULL. (in ccutil)
#include "platform.h" (in ccutil)
#include "errcode.h" (in ccutil)
#include "host.h" // For NULL. (in ccutil)
#include "helpers.h" (in ccutil)
#include "ndminx.h" (in ccutil)

Not sure what the problem you see is?

Ray Smith

unread,
Nov 8, 2011, 11:50:21 AM11/8/11
to tesser...@googlegroups.com
The problem with genericvector.h is that it includes host.h and errcode.h, and these #define some awkwardly common names like ABORT and LOG, which are best kept out of the public includes.

Tom Powers

unread,
Nov 8, 2011, 6:21:18 PM11/8/11
to tesseract-dev
In preparation for deciding which tesseract header files would need to
be copied to a "public" include directory, I took a look at what
baseapi.h actual does. Given that in baseapi.h there is:

class DENORM;
class ETEXT_DESC;
struct OSResults;
class ROW;
class STRING;
class Dawg;

here's some questions about using the api:

int tesseract::TessBaseAPI::Recognize (ETEXT_DESC *monitor)
inttesseract::TessBaseAPI::RecognizeForChopTest (ETEXT_DESC
*monitor)

But how do you get an ETEXT_DESC *?

bool tesseract::TessBaseAPI::ProcessPages (const char *filename,
const char *retry_config, int timeout_millisec, STRING *text_out)
bool tesseract::TessBaseAPI::ProcessPage (Pix *pix, int page_index,
const char *filename, const char *retry_config, int timeout_millisec,
STRING *text_out)

But to get anything useful out of text_out you presumably have to
include "strngs.h" (as tesseractmain.cpp does)? This will then also
include:

#include "memry.h" (in ccutil)
#include "host.h" (in ccutil)
#include "serialis.h" (in ccutil)
#include "memry.h"
#include "errcode.h" (in ccutil)
#include "fileerr.h" (in ccutil)
#include "errcode.h"

at which point you may as well also include "genericvector.h" since
the two offending header files (Ray Smith mentions host.h and
errcode.h in an earlier post) are also required by strngs.h.

bool tesseract::TessBaseAPI::DetectOS (OSResults *)

How do you get an OSResults *?

void tesseract::TessBaseAPI::GetFeaturesForBlob (TBLOB *blob, const
DENORM &denorm, INT_FEATURE_ARRAY int_features, int *num_features, int
*FeatureOutlineIndex)

Seems to me you need normalis.h (in ccstruct), to call this function?
And also intproto.h (in classify), which includes a bunch of other
headers to get INT_FEATURE_ARRAY?

static ROW * tesseract::TessBaseAPI::FindRowForBox (BLOCK_LIST
*blocks, int left, int top, int right, int bottom)

But what can you do with a ROW * (other than call
tesseract::TessBaseAPI::NormalizeTBLOB())?

void tesseract::TessBaseAPI::RunAdaptiveClassifier (TBLOB *blob,
const DENORM &denorm, int num_max_matches, int *unichar_ids, float
*ratings, int *num_matches_returned)

Again, to get a DENORM you presumably need normalis.h?

const Dawg * tesseract::TessBaseAPI::GetDawg (int i) const

But what can you do with a Dawg *?

static ROW * tesseract::TessBaseAPI::MakeTessOCRRow (float
baseline, float xheight, float descender, float ascender)

Again, what can you do with a ROW *?

CubeRecoContext *tesseract::TessBaseAPI::GetCubeRecoContext ()
const

Need cube_reco_context.h which includes the following:

#include "neural_net.h"
#include "lang_model.h"
#include "classifier_base.h"
#include "feature_base.h"
#include "char_set.h"
#include "word_size_model.h"
#include "char_bigrams.h"
#include "word_unigrams.h"

Basically, it seems like parts of the api return "opaque" pointers to
things, but to really get any use out of those objects you really need
to include the actual related headers (which typically include lots of
other headers in turn).

Tom Powers

unread,
Nov 8, 2011, 10:00:22 PM11/8/11
to tesseract-dev
Attached is tesseract-3.01-win32-include-dir-sampleapp-2011-11-08.zip,
the result of my investigations on the "public" header files required
to build applications that link with libtesseract. It also includes a
sample Win32 Console app that demonstrates linking with the "public"
include & lib directories.

My python program copies the union of the following files to the
"public" include folder:

baseIncludeSet = {
r"api\baseapi.h",
r"api\apitypes.h",
r"ccstruct\publictypes.h",
r"ccmain\thresholder.h",
r"ccutil\unichar.h",
r"ccutil\platform.h",
r"api\resultiterator.h",
r"api\pageiterator.h",
r"api\apitypes.h",
}

strngIncludeSet = {
r"ccutil\strngs.h",
r"ccutil\memry.h",
r"ccutil\host.h",
r"ccutil\serialis.h",
r"ccutil\errcode.h",
r"ccutil\fileerr.h",
}

genericVectorIncludeSet = {
r"ccutil\genericvector.h",
r"ccutil\tesscallback.h",
r"ccutil\errcode.h",
r"ccutil\host.h",
r"ccutil\helpers.h",
r"ccutil\ndminx.h",
}

blobsIncludeSet = {
r"ccstruct\blobs.h",
r"ccstruct\rect.h",
r"ccstruct\points.h",
r"ccstruct\ipoints.h",
r"ccutil\elst.h",
r"ccutil\host.h",
r"ccutil\serialis.h",
r"ccutil\lsterr.h",
r"ccutil\ndminx.h",
r"ccutil\tprintf.h",
r"ccutil\params.h",
r"viewer\scrollview.h",
r"ccstruct\vecfuncs.h",
}

Currently 28 files are in the include\tesseract directory.

I decided to include blobs.h since it seemed to me people might want
blob information (and tesseractmain.h includes it).

I'm not positive this is the minimal set of headers files needed, but
I have successfully built and run a sample program that includes:

#include "allheaders.h"
#include "baseapi.h"
#include "resultiterator.h"
#include "strngs.h"
#include "blobs.h"

The supplied sample app was created using the VS2008 "Win32 Console
Application" Project Wizard and then just copying most of
tesseractmain.cpp. It's project settings correctly refer to the
"public" include & lib directories using relative paths. The
include\tesseract_versionnumbers.vsprops Property Sheet is used to
avoid explicit library version number dependencies. Precompiled
headers are used. LIB_Debug, LIB_Release, DLL_Debug, and DLL_Release
build configurations are supported.

I believe all the pieces are now available for someone to create an
official tesseract-3.01-win32-include-lib.zip. This would provide all
the files people need to link to libtesseract without having to
compile it themselves (or download the full source). OTOH, I'm not
sure how useful this will be? Most people who are seriously using
libtesseract & Leptonica should download & build them on their own.

I'm still tweaking my python script but I'll make that available in
the near future. I want to add the ability to compare the libtesseract
Project files with the actual files in the "sub-library" directories.
This should make it easier to keep the VS2008 solution in synch as the
underlying files are deleted/added.

tesseract-3.01-win32-include-dir-sampleapp-2011-11-08.zip

zdenko podobny

unread,
Nov 9, 2011, 10:02:12 AM11/9/11
to tesser...@googlegroups.com
Thanks for you files.
I just started my tests and found these issues:
  1. fatal error RC1015: cannot open include file 'afxres.h'
  2. project 'cntraining' includes 'cntraning.h' but there is no such file in ;-)
As far as I quickly check on internet  'afxres.h' is part of 'Microsoft Platform SDK'[1]. Is it really needed? I just try to avoid another dependancy...

Another issue is regarding svn commit ;-) I am fine with remove leptonica from tesseract source and keep it as external library.  But tesseract relevant files (\include\leptonica_versionnumbers.vsprops,  \include\tesseract_versionnumbers.vsprops and \include\stdint.h ) should be moved to \tesseract-3.01\vs2008 directory structure... 

Zdenko

Tom Powers

unread,
Nov 9, 2011, 1:13:55 PM11/9/11
to tesser...@googlegroups.com
On Wed, Nov 9, 2011 at 7:02 AM, zdenko podobny <zde...@gmail.com> wrote:
> Thanks for you files.
> I just started my tests and found these issues:
>
> fatal error RC1015: cannot open include file 'afxres.h'
> project 'cntraining' includes 'cntraning.h' but there is no such file in ;-)
>

Oooops. Removed it.

> As far as I quickly check on internet  'afxres.h' is part of 'Microsoft
> Platform SDK'[1]. Is it really needed? I just try to avoid another
> dependancy...
>

Hmmmm. I'm using VS2008 NOT the VC++ 2008 Express Edition. On my
system I see afxres.h in C:\Program Files\Microsoft Visual Studio
9.0\VC\atlmfc\include\. But I gather that the Express Edition doesn't
come with support for MFC and therefore probably doesn't have this
folder? [1]

You can just manually change:

#include "afxres.h"

to

#include "windows.h"

However, any time someone changes the version resource using the
VS2008 Resource Editor, that line will get changed back again. I also
gather that the Express Edition doesn't come with any sort of Resource
Editor? You can just edit the various *.rc files directly to fix the
VS_VERSION_INFO VERSIONINFO section. Alternatively, you could use
something like http://www.resedit.net but that also requires the
Windows Platform SDK.

I'll add a short section to my readme.txt file on using the Express
Edition and that error message.

> Another issue is regarding svn commit ;-) I am fine with remove leptonica
> from tesseract source and keep it as external library.  But tesseract
> relevant files (\include\leptonica_versionnumbers.vsprops,
>  \include\tesseract_versionnumbers.vsprops and \include\stdint.h ) should be
> moved to \tesseract-3.01\vs2008 directory structure...

I'll make a copy of these in a vs2008\include directory. However, the
Project files use a relative path like "..\..\..\include" to reference
the C:\BuildFolder\include directory. Therefore the source package for
Win32 needs to be structured like I have it (but with the additional
vs2008\include directory) That will make it easy to check the vs2008
directory into the SVN repository but still let people just extract
the archive and not have to worry about copying the vs2008\include
directory to C:\Builder\Include.

[1] ""How useful is Visual C++ 2008 Express Edition for commercial
use?"" http://jayakrishnan.livejournal.com/3521.html

zdenko podobny

unread,
Nov 9, 2011, 3:26:01 PM11/9/11
to tesser...@googlegroups.com
On Wed, Nov 9, 2011 at 7:13 PM, Tom Powers <tomp...@gmail.com> wrote:
On Wed, Nov 9, 2011 at 7:02 AM, zdenko podobny <zde...@gmail.com> wrote:
> Thanks for you files.
> I just started my tests and found these issues:
>
> fatal error RC1015: cannot open include file 'afxres.h'
> project 'cntraining' includes 'cntraning.h' but there is no such file in ;-)
>

Oooops. Removed it.

> As far as I quickly check on internet  'afxres.h' is part of 'Microsoft
> Platform SDK'[1]. Is it really needed? I just try to avoid another
> dependancy...
>

Hmmmm. I'm using VS2008 NOT the VC++ 2008 Express Edition. On my
system I see afxres.h in C:\Program Files\Microsoft Visual Studio
9.0\VC\atlmfc\include\. But I gather that the Express Edition doesn't
come with support for MFC and therefore probably doesn't have this
folder? [1]

I did not find afxres.h in C:\Program Files\Microsoft Visual Studio
9.0

You can just manually change:

 #include "afxres.h"

to

  #include "windows.h"

Thanks this works.
 
However, any time someone changes the version resource using the
VS2008 Resource Editor, that line will get changed back again. I also
gather that the Express Edition doesn't come with any sort of Resource
Editor?
When I clicked on resource file e.g. libtesserac301.rc (in VC++ 2008 Express) I see message:
Resource Editing is not supported on the Visual C++ Express SKU.
 
You can just edit the various *.rc files directly to fix the
VS_VERSION_INFO VERSIONINFO section. Alternatively, you could use
something like http://www.resedit.net but that also requires the
Windows Platform SDK.

I'll add a short section to my readme.txt file on using the Express
Edition and that error message.

> Another issue is regarding svn commit ;-) I am fine with remove leptonica
> from tesseract source and keep it as external library.  But tesseract
> relevant files (\include\leptonica_versionnumbers.vsprops,
>  \include\tesseract_versionnumbers.vsprops and \include\stdint.h ) should be
> moved to \tesseract-3.01\vs2008 directory structure...

I'll make a copy of these in a vs2008\include directory. However, the
Project files use a relative path like "..\..\..\include" to reference
the C:\BuildFolder\include directory. Therefore the source package for
Win32 needs to be structured like I have it (but with the additional
vs2008\include directory) That will make it easy to check the vs2008
directory into the SVN repository but still let people just extract
the archive and not have to worry about copying the vs2008\include
directory to C:\Builder\Include.


thanks for care.

Tom Powers

unread,
Nov 10, 2011, 4:38:59 PM11/10/11
to tesseract-dev
I finished my tesshelp.py Python 2.7 script whose operations can aid
libtesseract VS2008 Project maintainers. From the help message:

usage: tesshelper.py [-h] [--version] tessDir
{compare,report,copy} ...

positional arguments:
tessDir tesseract installation directory

optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit

Commands:
{compare,report,copy}
compare compare libtesseract Project with tessDir
report report libtesseract summary stats
copy copy public libtesseract header files to
includeDir

Examples:

Assume that tesshelper.py is in c:\buildfolder
\tesseract-3.01\vs2008,
which is also the current directory. Then,

python27 tesshelper .. compare

will compare c:\buildfolder\tesseract-3.01 "library" directories
to the
libtesseract Project
(c:\buildfolder\tesseract-3.01\vs2008\libtesseract
\libtesseract.vcproj).

python27 tesshelper .. report

will display summary stats for c:\buildfolder\tesseract-3.01
"library"
directories and the libtesseract Project.

python27 tesshelper .. copy ..\..\include

will copy all "public" libtesseract header files to
c:\buildfolder\include.

When I first ran the compare command I saw this:

Comparing VS2008 Project "..\vs2008\libtesseract
\libtesseract.vcproj" with
"C:\BuildFolder\tesseract-3.01"
10 Extra files (in C:\BuildFolder\tesseract-3.01 but not in
Project)
api\tesseractmain.cpp
api\tesseractmain.h
ccutil\scanutils.cpp
training\cntraining.cpp
training\combine_tessdata.cpp
training\mergenf.cpp
training\mergenf.h
training\mftraining.cpp
training\unicharset_extractor.cpp
training\wordlist2dawg.cpp
7 Dead files (in Project but not in C:\BuildFolder\tesseract-3.01
ccutil\mainblk.h
ccutil\ocrshell.h
ccutil\tessclas.h
ccutil\tessopt.h
ccutil\tordvars.h
textord\blobcmpl.h
textord\tospace.h

9 of the files in the "Extra" list are as expected, since those api
and training files are used to make the separate executables and NOT
libtesseract (this script only bothers to check libtesseract.vcproj,
which is also why it didn't catch the extra cntraining.h file). I'm
not sure about scanutils.cpp? scanutils.h does nothing unless EMBEDDED
is defined.

I've removed the 7 "Dead" files from my libtesseract Project (which
was initially filled by copying the guts of tesseract-3.01-win_vs.zip
(http://code.google.com/p/tesseract-ocr/downloads/detail?
name=tesseract-3.01-win_vs.zip).

The compare command now shows the following for my current Project
file:

Comparing VS2008 Project "..\vs2008\libtesseract
\libtesseract.vcproj" with
"C:\BuildFolder\tesseract-3.01"
10 Extra files (in C:\BuildFolder\tesseract-3.01 but not in
Project)
api\tesseractmain.cpp
api\tesseractmain.h
ccutil\scanutils.cpp
training\cntraining.cpp
training\combine_tessdata.cpp
training\mergenf.cpp
training\mergenf.h
training\mftraining.cpp
training\unicharset_extractor.cpp
training\wordlist2dawg.cpp
0 Dead files (in Project but not in C:\BuildFolder\tesseract-3.01

The report command has this to say (it looks better with a fixed width
font):

Summary stats for "C:\BuildFolder\tesseract-3.01" library
directories

total h cpp
----- --- ---
9 5 4 api
46 20 26 ccmain
73 39 34 ccstruct
65 41 24 ccutil
61 31 30 classify
70 39 31 cube
28 15 13 cutil
16 7 9 dict
11 7 4 image
7 3 4 neural_networks\runtime
69 34 35 textord
11 3 8 training
7 3 4 viewer
1 1 0 vs2008\libtesseract
5 3 2 vs2008\port
44 21 23 wordrec
----- --- ---
523 272 251

Summary stats for VS2008 Project "..\vs2008\libtesseract
\libtesseract.vcproj"
270 Header files
243 Source files
1 Resource files
-----
514

The copy command does this:

Copying libtesseract "public" headers to ..\..\include\tesseract

Copied: api\apitypes.h
Copied: api\baseapi.h
Copied: api\pageiterator.h
Copied: api\resultiterator.h
Copied: ccmain\thresholder.h
Copied: ccstruct\blobs.h
Copied: ccstruct\ipoints.h
Copied: ccstruct\points.h
Copied: ccstruct\publictypes.h
Copied: ccstruct\rect.h
Copied: ccstruct\vecfuncs.h
Copied: ccutil\elst.h
Copied: ccutil\errcode.h
Copied: ccutil\fileerr.h
Copied: ccutil\genericvector.h
Copied: ccutil\helpers.h
Copied: ccutil\host.h
Copied: ccutil\lsterr.h
Copied: ccutil\memry.h
Copied: ccutil\ndminx.h
Copied: ccutil\params.h
Copied: ccutil\platform.h
Copied: ccutil\serialis.h
Copied: ccutil\strngs.h
Copied: ccutil\tesscallback.h
Copied: ccutil\tprintf.h
Copied: ccutil\unichar.h
Copied: viewer\scrollview.h
28 header files successfully copied to "..\..\include\tesseract"

Copying libtesseract "extra" headers to ..\..\include

Copied: vs2008\include\leptonica_versionnumbers.vsprops
Copied: vs2008\include\stdint.h
Copied: vs2008\include\tesseract_versionnumbers.vsprops
3 header files successfully copied to "..\..\include"

I'll add tesshelper.py to my next proposed package of "Windows Project
files for Tesseract 3.01".

I'm currently cleaning up my readme.txt so it works with Sphinx. I'll
upload the next iteration (with a Sphinx-generated vs2008\doc
directory) as soon as I'm done.

I investigated having Sphinx convert the ReST to Google Code
style .wiki files but there doesn't seem to be any tools for doing
that yet? Maybe pandoc (http://johnmacfarlane.net/pandoc/) can convert
the Sphinx HTML to something compatible with Google Code's wiki pages?
Or maybe once HTML files are added to the Google Code repository, they
can be linked/viewed directly (I seem to remember reading about that
somewhere)? Then one of the existing wiki pages could just link to the
VS2008 HTML readme in the repository.

Naveen

unread,
Nov 8, 2011, 5:48:51 PM11/8/11
to tesseract-dev
A good example project using libtesseract would be swig.
I have been trying, but haven't succeeded yet.

Naveen

unread,
Nov 9, 2011, 10:00:21 AM11/9/11
to tesseract-dev
Thanks Tom,
I was able to build python wrappings with swig using the 3 main
headers basapi.h, result integrator and page iteration if I recall
correctly.. However I had to use a few vcproj files from the old
vs2008 dir such as ccutil, image, few others, otherwise I was getting
function not found linking errors.
I don't think people need to build them on their own given convenient
python or c wrappings.
I will publish my swig project, once I clean it up and integrate with
your latest sample.
Naveen

zdenko podobny

unread,
Nov 15, 2011, 3:05:44 AM11/15/11
to tesser...@googlegroups.com
Naveen,

I did not test it yet, but there is python wrapper [1] for tesseract 3.01. If I understood it corretly it is build based on/with swig.  AFAIK it was not tested with VC++, but it should be multiplatform solution. Maybe it can save your time... 


Zdenko

[1] http://code.google.com/p/python-tesseract 

Naveen

unread,
Nov 15, 2011, 12:14:44 PM11/15/11
to tesseract-dev
I used wrapper you mention as a guide.

I had to modify it some because that one uses fmemopen which is not
available for visual c++.
Instead, I used the ctypes wrapper for leptonica[1] to feed tesseract
a pix from memory.

I added a couple of wrapper functions in baseapi.cpp:

https://github.com/tinku99/tesseract-ocr/commit/05262e91df9e66f6a0656818d89844326fe9b7be#diff-2

char* TessBaseAPI::ProcessPagesWrapper(const char* image) {
//printf("ok->%s",text_out);
STRING mstr;
ProcessPages(image, NULL, 0, &mstr);
const char *tmpStr=mstr.string();
char *retStr = new char[strlen(tmpStr) + 1];
strcpy (retStr,tmpStr);
return retStr;
}

char* TessBaseAPI::ProcessPagesBuffer(unsigned int buffer) {
Pix *pix;
int page=0;
STRING mstr;

pix = (Pix *) buffer ;
ProcessPage(pix, page, NULL, NULL, 0, &mstr);
const char *tmpStr=mstr.string();
char *retStr = new char[strlen(tmpStr) + 1];
strcpy (retStr,tmpStr);
//printf("ok->%s",retStr);

return retStr;
}

Then, I was able to call it from python as follows:

import leptonica
from leptonica import *
import tesseract
import Image
import ctypes
api = tesseract.TessBaseAPI()
api.SetOutputName("outputName");
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

im = Image.open("eurotext.tif")
pix = PILImageToPix(im)

api.ProcessPagesBuffer(pix._address_.value)


[1]: http://code.google.com/p/pylepthonica/wiki/Home
Reply all
Reply to author
Forward
0 new messages