visibility

257 views
Skip to first unread message

zdenko podobny

unread,
Feb 27, 2012, 2:43:53 PM2/27/12
to tesser...@googlegroups.com
I am forwarding this communication to developer forums.
Contribution from somebody with experience in this area is to be warmly welcomed.

Zdenko

---------- Forwarded message ----------
From: Tom Powers <tomp...@gmail.com>
Date: Mon, Feb 27, 2012 at 12:31 AM
Subject: Re: Replacing the tesseract 3.02 alpha vs2008 directory
To: zdenko podobny <zde...@gmail.com>


On Sun, Feb 26, 2012 at 1:55 PM, zdenko podobny <zde...@gmail.com> wrote:
> Do you plant to finish  "a longish post that discusses visibility issue"[1]?
>  It looks like nobody is expert on this[2]...
> So any research/help is welcomed.
>
> Zdenko
>
> [1]  http://groups.google.com/group/tesseract-dev/msg/109a5ef2e7e88282?pli=1
> [2] http://code.google.com/p/tesseract-ocr/issues/detail?id=287
>

Yes. Basically, you have to change this part of api/baseapi.h [1] to
more closely emulate what is done in the Visibility article [2]. That
part is easy. Maybe take me a few minutes to change what it has to
something we could use instead.

The *hard* part, is that someone (and I'm *not* familiar enough with
the code to do this), has to explicitly mark everything that has to
have a public visibility with TESS_API. (Public classes can have
private stuff by marking them with TESS_LOCAL).

But... maybe you need a two-tier approach? Mark only classes that are
needed to make basic use of TessBaseAPI with TESS_API (like the
tesseract::TessBaseAPI class and STRING class). Then use TESS_FULL_API
when generating shared libraries that need *all* public classes like
tesseract::PageIterator, tesseract::ResultIterator, tesseract::Dawg,
etc. (Or maybe it's not worth the bother? The main difference is that
you'd have to make a bunch more header files public, that is copy them
to the "public" include folder, for TESS_FULL_API builds).

Once this laborious process is done, you can only then turn on that
"-fvisibility=hidden" switch you tried in issue 287. This makes
everything not explicitly marked TESS_API (or TESS_FULL_API) "hidden"
as far as linking goes, and therefore breaks everything. Trying to
build tesseract was the easiest case since it tries on purpose to not
use very many classes. Try the same thing with one of the more
exhaustive training apps and watch while the entire link blows up in
your face.

I was going to finish my APIExamples solution (and the associated
documentation) first (maybe another week?). Should I drop that for now
and instead work on my Visibility post? If you want to implement
limited visibility then someone probably needs to start on that *RIGHT
NOW*, since it would involve making global (though minor) changes, and
take a bit of futzing to get right. People will also need time to give
feedback on whether all the needed classes had indeed been made
public.

In the meantime, everyone could read "How to Write Shared Libraries"
by Ulrich Drepper, Dec 10, 2011 [3]

BTW, this only applies to dynamic libraries. I'm not sure yet what the
implications are for static libraries.

[1] http://code.google.com/p/tesseract-ocr/source/browse/trunk/api/baseapi.h#66

[2] http://gcc.gnu.org/wiki/Visibility

[3] http://people.redhat.com/drepper/dsohowto.pdf

   -- Tom

James Le Cuirot

unread,
Feb 27, 2012, 3:31:09 PM2/27/12
to tesser...@googlegroups.com
On Mon, 27 Feb 2012 20:43:53 +0100
zdenko podobny <zde...@gmail.com> wrote:

> I am forwarding this communication to developer forums.
> Contribution from somebody with experience in this area is to be
> warmly welcomed.

I'm not an expert in this area but I understand what it entails and the
benefits it brings. If there's one project that needs it badly, it's
this one so kudos for bringing the issue to the table. :)

One guy who is an expert is Diego "Flameeyes" Pettenò. He's extremely
busy with other projects though so I'd prefer that you didn't bother
him but his blog is an invaluable resource on many topics including
this one. Just search for "visibility".

http://blog.flameeyes.eu

Regards,
James

Tom Powers

unread,
Feb 27, 2012, 3:36:13 PM2/27/12
to tesser...@googlegroups.com

Just to get an idea of the size of the visibility issue, I tried the
suggestion from the GCC Visibility article [1], and got:

$ nm -C -D libtesseract.so | wc -l
5846
$ nm -C -D liblept.so | wc -l
2218

where the libtesseract was built a week or so ago. Do you really
need/want 5800+ dynamic symbols exported? For liblept, that number is
a direct reflection of what's listed in leptprotos.h, so presumably
all those symbols really are relevant.

I've started reading "How to Write Shared Libraries" [2], it's pretty
slow-going, and since Ulrich isn't, apparently, a native-english
speaker some sentences need to be read a few times to understand what
he means.

Other useful references are:

+ The GCC documentation on the -fvisibility option is at "3.18 Options for
Code Generation Conventions" [3]

"Set the default ELF image symbol visibility to the specified
option—all symbols will be marked with this unless overridden within
the code. Using this feature can very substantially improve linking
and load times of shared object libraries, produce more optimized
code, provide near-perfect API export and prevent symbol clashes. It
is strongly recommended that you use this in any shared objects you
distribute."

+ The GCC documentation on the "visibility" attribute is at "6.30
Declaring Attributes of Functions" [4].

-- Tom

[1] http://gcc.gnu.org/wiki/Visibility

[2] http://people.redhat.com/drepper/dsohowto.pdf

[3] http://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html

[4] http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html

Tom Powers

unread,
Feb 28, 2012, 12:06:51 AM2/28/12
to tesser...@googlegroups.com

By contrast, for Windows:

>dumpbin /exports libtesseract302.dll

Microsoft (R) COFF/PE Dumper Version 9.00.30729.01
Copyright (C) Microsoft Corporation. All rights reserved.

Dump of file libtesseract302.dll

File Type: DLL

Section contains the following exports for libtesseract302.dll

00000000 characteristics
4F45AA1B time date stamp Wed Feb 22 18:53:15 2012
0.00 version
1 ordinal base
128 number of functions
128 number of names

It's not surprising that only tesseract.exe can successfully link with
libtesseract302.dll, and that all the training apps can currently only
be linked with the static libraries.

-- Tom

Ray Smith

unread,
Feb 28, 2012, 12:18:00 AM2/28/12
to tesser...@googlegroups.com
Not surprisingly, this looks remarkably like the process for making symbols visible in Windows DLLs. One thing that windows has though is the ability to mark a whole class for export at the class definition. Is there an equivalent for Linux visibility? That would make the process a lot easier.

On Mon, Feb 27, 2012 at 12:36 PM, Tom Powers <tomp...@gmail.com> wrote:

Tom Powers

unread,
Feb 28, 2012, 2:25:03 AM2/28/12
to tesser...@googlegroups.com
On Mon, Feb 27, 2012 at 9:18 PM, Ray Smith <thera...@gmail.com> wrote:
> Not surprisingly, this looks remarkably like the process for making symbols
> visible in Windows DLLs. One thing that windows has though is the ability to
> mark a whole class for export at the class definition. Is there an
> equivalent for Linux visibility? That would make the process a lot easier.

Yes. That's why the changes needed are minor (except for the fact that
someone has to figure out which classes to mark). Remember that making
classes explicitly visible will of course change the list of "public"
header files that need to be copied to the "public" include directory
--- for those people (normally Windows developers) who don't have the
tesseract source or don't want to bother adding all the tesseract
sub-dirs to their list of include dirs.

The first example in [1] seems dangerous to me because it is marking
up declarations even for static libraries (which I recall reading
somewhere is a bad idea, probably the informative blog [2] James Le
Cuirot mentioned).

The second example in [1] seems clumsy to me since you have to #define
both FOX_DLL and FOX_DLL_EXPORTS. I like the current technique that
both tesseract and leptonica follow which is to say x_EXPORTS,
x_IMPORTS or neither to indicate static library creation/use.

So the following lines in api/baseapi.h:

#ifdef TESSDLL_EXPORTS
#define TESSDLL_API __declspec(dllexport)
#elif defined(TESSDLL_IMPORTS)
#define TESSDLL_API __declspec(dllimport)
#else
#define TESSDLL_API
#endif

need to be changed to something like (I'll test this tomorrow to be sure):

#if defined(_WIN32) || defined(__CYGWIN__)
#if defined(TESS_EXPORTS)
#define TESS_API __declspec(dllexport)
#elif defined(TESS_IMPORTS)
#define TESS_API __declspec(dllimport)
#else
#define TESS_API
#endif
#define TESS_LOCAL
#else
#if __GNUC__ >= 4
#if defined(TESS_EXPORTS) || defined(TESS_IMPORTS)
#define TESS_API __attribute__ ((visibility ("default")))
#define TESS_LOCAL __attribute__ ((visibility ("hidden")))
#else
#define TESS_API
#define TESS_LOCAL
#endif
#else
#define TESS_API
#define TESS_LOCAL
#endif
#endif

(Where I got rid of the DLL part of the macros because unix doesn't
use the term)

(I'm also not sure how this effects MinGW builds?)

We could commit the baseapi.h changes (including those below)
immediately to the repository because it doesn't really change
anything until you do the next steps.

In make files you either have to define TESS_EXPORTS when building a
DLL or Shared library, TESS_IMPORTS when linking with a DLL or shared
library, or neither when building or linking with a static library.

Additionally, on unix you then have to start using -fvisibility=hidden
and -fvisibility-inlines-hidden on shared library builds (still not
sure what effect, if any, those flags have on static libraries). Once
you do this then *only* objects marked with TESS_API will be visible
in shared libraries (on Windows this already happens automatically
which is what is causing all the portability problems).

Now for only *declarations* in the header files (you don't have to
change definitions at all) you need to add TESS_API to all things you
want to make visible in share libraries/DLLs.

So for example, in api/baseapi.h you just have to change:

class TESSDLL_API TessBaseAPI {

to

class TESS_API TessBaseAPI {

to make the entire TessBaseAPI Class visible in shared libraries. Use
the TESS_LOCAL macro with TessBaseAPI members you then *don't* want
exported.

and ccutil/strngs.h needs to go from:

class CCUTIL_API STRING

to

class TESS_API STRING

If you decide, for example, you want to make the entire PageIterator
Class visible, change ccmain/pageiterator.h from:

class PageIterator {

to:

class TESS_API PageIterator {

-- Tom

[1] http://gcc.gnu.org/wiki/Visibility

[2] http://blog.flameeyes.eu/tag/visibility and http://blog.flameeyes.eu/tag/elf

Tom Powers

unread,
Feb 28, 2012, 3:12:26 AM2/28/12
to tesser...@googlegroups.com

Actually api/baseapi.h is probably not the place to put these changes.
Maybe move it to api/apitypes.h instead? And then you also have to
include apitypes.h in any header that needs to use TESS_API (which is
why using baseapi wasn't a good idea).

I should reiterate that I'm no "expert" in any of this. Whoever does
this better understand all the implications themselves. In particular,
I have no idea what impact the "Problems with C++ exceptions (please
read!)" section of [1] will have.

[1] http://gcc.gnu.org/wiki/Visibility

          -- Tom

Tom Powers

unread,
Feb 29, 2012, 3:28:15 AM2/29/12
to tesser...@googlegroups.com
I've successfully built and run tesseract r684 on Ubuntu 11.10 using
my proposed visibility macros. I ended up adding the macros for
TESS_API to ccutil/platform.h. I originally added them to apitypes.h
but found that I then had to add api and ccstruct as additional
include directories for most of the Makefiles.

Since I'm not familiar with all the details of
autoconf/automake/libtools, I had a hard time figuring out how to
change the various Makefile.am files. I attached a diff of all my
changes but someone else should really figure out the correct changes
to do. I essentially added the following to the AM_CPPFLAGS of all the
"convenience library" Makefile.am's

-DTESS_EXPORTS -fvisibility=hidden -fvisibility-inlines-hidden

But doing this seems to not only affect building the shared libraries
but the static ones also:

/bin/bash ../libtool --tag=CXX --mode=compile g++ -DHAVE_CONFIG_H
-I. -I.. -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../image
-I../viewer -I../classify -I../dict -I../wordrec -I../cutil
-I../neural_networks/runtime -I../cube -I../textord
-DTESS_EXPORTS -fvisibility=hidden -fvisibility-inlines-hidden
-I/usr/local/include/leptonica
-g -O2 -MT tfacepp.lo -MD -MP -MF .deps/tfacepp.Tpo
-c -o tfacepp.lo tfacepp.cpp

libtool: compile: g++ -DHAVE_CONFIG_H
-I. -I.. -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../image
-I../viewer -I../classify -I../dict -I../wordrec -I../cutil
-I../neural_networks/runtime -I../cube -I../textord
-DTESS_EXPORTS -fvisibility=hidden -fvisibility-inlines-hidden
-I/usr/local/include/leptonica
-g -O2 -MT tfacepp.lo -MD -MP -MF .deps/tfacepp.Tpo
-c tfacepp.cpp -fPIC -DPIC -o .libs/tfacepp.o

libtool: compile: g++ -DHAVE_CONFIG_H
-I. -I.. -DUSE_STD_NAMESPACE -I../ccutil -I../ccstruct -I../image
-I../viewer -I../classify -I../dict -I../wordrec -I../cutil
-I../neural_networks/runtime -I../cube -I../textord
-DTESS_EXPORTS -fvisibility=hidden -fvisibility-inlines-hidden
-I/usr/local/include/leptonica
-g -O2 -MT tfacepp.lo -MD -MP -MF .deps/tfacepp.Tpo
-c tfacepp.cpp -o tfacepp.o >/dev/null 2>&1

mv -f .deps/tfacepp.Tpo .deps/tfacepp.Plo

What's the correct way to change the Makefiles to give separate
CPPFLAGS for shared library builds versus static library builds?

I get two tesseract's, one in api and another in api/.libs. Both
versions of tesseract correctly OCR eurotext.tif.

tesseract-3.02apha/api$ ls -agGF
total 840
drwxrwxr-x 4 4096 2012-02-28 22:44 ./
drwxrwxr-x 26 4096 2012-02-28 23:01 ../
-rw-rw-r-- 1 1392 2012-02-28 21:28 apitypes.h
-rw-rw-r-- 1 72417 2012-02-28 11:48 baseapi.cpp
-rw-rw-r-- 1 30466 2012-02-28 12:04 baseapi.h
drwxrwxr-x 2 4096 2012-02-28 22:41 .deps/
drwxrwxr-x 2 4096 2012-02-28 22:45 .libs/
-rw-rw-r-- 1 909 2012-02-28 22:38 libtesseract_api.la
-rw-rw-r-- 1 351 2012-02-28 22:38 libtesseract_api_la-baseapi.lo
-rw-rw-r-- 1 591420 2012-02-28 22:38 libtesseract_api_la-baseapi.o
-rw-rw-r-- 1 1167 2012-02-28 22:38 libtesseract.la
-rw-rw-r-- 1 28705 2012-02-28 22:44 Makefile
-rwxrw-rw- 1 2775 2012-02-28 22:40 Makefile.am*
-rw-rw-r-- 1 30797 2012-02-28 22:44 Makefile.in
-rwxrwxr-x 1 7482 2012-02-28 22:41 tesseract*
-rw-rw-r-- 1 9928 2012-02-26 11:25 tesseractmain.cpp
-rw-rw-r-- 1 1754 2012-02-22 18:15 tesseractmain.h
-rw-rw-r-- 1 32140 2012-02-28 22:41 tesseract-tesseractmain.o

tesseract-3.02apha/api$ ldd tesseract
not a dynamic executable

tesseract-3.02apha$ api/tesseract eurotext.tif eurotext-static
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0

tesseract-3.02apha$ cat eurotext-static.txt
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspa...@website.com is spam.
Der ,,schnelle” braune Fuchs springt
fiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén répido salta sobre el perro
perezoso. A raposa marrom rzipida
salta sobre o cfio preguicoso.

tesseract-3.02apha/api/.libs$ ls -agGF
total 68784
drwxrwxr-x 2 4096 2012-02-28 22:45 ./
drwxrwxr-x 4 4096 2012-02-28 22:44 ../
-rw-rw-r-- 1 46665172 2012-02-28 22:38 libtesseract.a
-rw-rw-r-- 1 611716 2012-02-28 22:38 libtesseract_api.a
lrwxrwxrwx 1 22 2012-02-28 22:38 libtesseract_api.la ->
../libtesseract_api.la
-rw-rw-r-- 1 603900 2012-02-28 22:38 libtesseract_api_la-baseapi.o
lrwxrwxrwx 1 18 2012-02-28 22:38 libtesseract.la -> ../libtesseract.la
-rw-rw-r-- 1 1168 2012-02-28 22:38 libtesseract.lai
lrwxrwxrwx 1 21 2012-02-28 22:38 libtesseract.so ->
libtesseract.so.3.0.2*
lrwxrwxrwx 1 21 2012-02-28 22:38 libtesseract.so.3 ->
libtesseract.so.3.0.2*
-rwxrwxr-x 1 22470351 2012-02-28 22:38 libtesseract.so.3.0.2*
-rwxrwxr-x 1 31539 2012-02-28 22:45 lt-tesseract*
-rwxrwxr-x 1 31539 2012-02-28 22:41 tesseract*

tesseract-3.02apha/api/.libs$ ldd tesseract
linux-gate.so.1 => (0x00e8f000)
libtesseract.so.3 => /usr/local/lib/libtesseract.so.3 (0x00110000)
liblept.so.2 => /usr/local/lib/liblept.so.2 (0x00757000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0x004ef000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0x0044d000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0x005da000)
libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0x0046b000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0x00c09000)
libz.so.1 => /lib/i386-linux-gnu/libz.so.1 (0x00486000)
libpng12.so.0 => /lib/i386-linux-gnu/libpng12.so.0 (0x0049b000)
libjpeg.so.62 => /usr/lib/i386-linux-gnu/libjpeg.so.62 (0x0090c000)
libgif.so.4 => /usr/lib/libgif.so.4 (0x004c5000)
libtiff.so.4 => /usr/lib/i386-linux-gnu/libtiff.so.4 (0x00eae000)
/lib/ld-linux.so.2 (0x004cf000)

tesseract-3.02apha$ api/.libs/tesseract eurotext.tif eurotext-shared
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0

tesseract-3.02apha$ cat eurotext-shared.txt
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspa...@website.com is spam.
Der ,,schnelle” braune Fuchs springt
fiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén répido salta sobre el perro
perezoso. A raposa marrom rzipida
salta sobre o cfio preguicoso.


I now get:

tesseract-3.02apha/api/.libs$ nm -C -D libtesseract.so.3.0.2 | wc -l
448

vs the 5846 dynamic symbols before. So the basic visibility technique
does work as expected.

It seems a bit surprising that the staticly linked tesseract is only
7482 bytes, while the version that links with the shared library is
31539 bytes?

For api/Makefile.am I tried setting AM_CPPFLAGS but discovered that
seems to be ignored when I also have to do:

libtesseract_api_la_CPPFLAGS = -DTESS_EXPORTS
tesseract_CPPFLAGS = -DTESS_IMPORTS

So I changed those lines to:

libtesseract_api_la_CPPFLAGS = $(AM_CPPFLAGS) -DTESS_EXPORTS
tesseract_CPPFLAGS = $(AM_CPPFLAGS) -DTESS_IMPORTS

BTW,

include_HEADERS = \
apitypes.h baseapi.h tesseractmain.h

but tesseractmain.h isn't a public header so shouldn't it instead be
added to tesseract_SOURCES? Automake's "9.2 Header files" section [1]
says:

"Usually, only header files that accompany installed libraries need
to be installed. Headers used by programs or convenience libraries
are not installed. The noinst_HEADERS variable can be used for such
headers. However when the header actually belongs to a single
convenience library or program, we recommend listing it in the
program's or library's _SOURCES variable (see Program Sources)
instead of in noinst_HEADERS."

I had to remove the training directory from the main Makefile.am,
since as predicted, those applications fail to build when the
-fvisibility=hidden flag is used. As a temporary measure, I suppose
you could somehow force apps in that particular directory to only link
with the static library.

-- Tom

[1] http://www.gnu.org/software/automake/manual/html_node/Headers.html#index-g_t_005fHEADERS-646

Tom Powers

unread,
Feb 29, 2012, 3:29:50 AM2/29/12
to tesser...@googlegroups.com
Ooops forgot to attach my diffs in my previous post.

-- Tom

trial_visibilty_for_r684.patch

Ray Smith

unread,
Feb 29, 2012, 11:45:31 PM2/29/12
to tesser...@googlegroups.com
Hi Tom,

This sounds like a useful change in the right direction, but I am not an autotools expert either.
Unless anyone in this small group objects to this change, I suggest I just add you to the list of developers and you can then check them in yourself.

I am not convinced that you have run 2 distinct tesseracts here, because of the libtools thing and the script that points to the executable hidden in .libs. Are you sure it didn't just execute the same dynamically linked version?

It seems a bit surprising that the staticly linked tesseract is only
7482 bytes, while the version that links with the shared library is
31539 bytes?
That's because libtool hides the exectuable away for some reason I don't understand, and the "staticly linked tesseract" is in fact a 7k shell script to execute the binary.
I would love to learn why this is the case, and why the shell script is so huge. Jimmy?

For api/Makefile.am I tried setting AM_CPPFLAGS but discovered that
seems to be ignored when I also have to do:

  libtesseract_api_la_CPPFLAGS = -DTESS_EXPORTS
  tesseract_CPPFLAGS = -DTESS_IMPORTS

So I changed those lines to:

  libtesseract_api_la_CPPFLAGS = $(AM_CPPFLAGS) -DTESS_EXPORTS
  tesseract_CPPFLAGS = $(AM_CPPFLAGS) -DTESS_IMPORTS

BTW,

 include_HEADERS = \
     apitypes.h baseapi.h tesseractmain.h

but tesseractmain.h isn't a public header so shouldn't it instead be
added to tesseract_SOURCES? Automake's "9.2 Header files" section [1]
says:

  "Usually, only header files that accompany installed libraries need
  to be installed. Headers used by programs or convenience libraries
  are not installed. The noinst_HEADERS variable can be used for such
  headers. However when the header actually belongs to a single
  convenience library or program, we recommend listing it in the
  program's or library's _SOURCES variable (see Program Sources)
  instead of in noinst_HEADERS."
Aha! Great information!

Tom Powers

unread,
Mar 1, 2012, 12:41:53 AM3/1/12
to tesser...@googlegroups.com
On Wed, Feb 29, 2012 at 8:45 PM, Ray Smith <thera...@gmail.com> wrote:
> Hi Tom,
>
> This sounds like a useful change in the right direction, but I am not an
> autotools expert either.
> Unless anyone in this small group objects to this change, I suggest I just
> add you to the list of developers and you can then check them in yourself.

Okay, I think the non-makefile changes are harmless to commit (since
they don't do anything on unix until the appropriate
macros are turned on via the makefiles). And checking those changes in
would make the VS2008 Solution maintenance easier since there I *do*
need to know what macros to define (and where it's really easy to set
them separately for each build configuration).

host.h has two lines in it defining DLLEXPORT & DLLIMPORT that should
be removed since they are redundant.

What should be done with DLLSYM? While obsolete maybe it's still a
hint to which classes/structs need to be visible? If not I can
globally remove it easily enough.

Ah! <blush> yes indeed the api/tesseract is a script that lets you
test tesseract before it is installed to /usr/local/bin. So the next
things I'd like to know are: is the statically linked executable ever
automatically made, what is it called, and where does it go?

Right now all the convenience libraries are putting *all* their header
files in include_HEADERS. At some point we need to figure out which of
these really need to be installed. Eventually the list of
include_HEADERS and my list of files that need to be copied to a
"public" include folder on Windows (currently 13 files) should become
the same thing.

Zdenko Podobný

unread,
Mar 3, 2012, 4:51:30 PM3/3/12
to tesser...@googlegroups.com
I implemented Toms patch to svn with small improvement:
if you want to use "visibility" you need to run ./configure with parameter --enable-visibility. This feature is an experimental, so it is not enabled by default.
In case of '--enable-visibility' training programs are static linked so their size is huge (80M static build vs 2,2M shared build). If size matter that programs could be stripped ('strip'[1] or 'make install-strip')

[1] http://linux.about.com/library/cmd/blcmdl1_strip.htm
(I think) he is correct. I have installed in /usr/local/ "standard" 3.02 version and "visibility" installation in /opt:
level2:/opt/lib64 # nm -DC /opt/lib64/libtesseract.so.3.0.2 | wc
    447    2080   32114
level2:/usr/local/lib64 # nm -DC /usr/local/lib64/libtesseract.so.3.0.2 | wc
   5846   27550  443290

ldd and ls -l provide different information so it should be 2 different tesseract libraries.
It seems a bit surprising that the staticly linked tesseract is only
7482 bytes, while the version that links with the shared library is
31539 bytes?

That's because libtool hides the exectuable away for some reason I don't
understand, and the "staticly linked tesseract" is in fact a 7k shell
script to execute the binary.
I would love to learn why this is the case, and why the shell script is so
huge. Jimmy?

For api/Makefile.am I tried setting AM_CPPFLAGS but discovered that
seems to be ignored when I also have to do:

  libtesseract_api_la_CPPFLAGS = -DTESS_EXPORTS
  tesseract_CPPFLAGS = -DTESS_IMPORTS

So I changed those lines to:

  libtesseract_api_la_CPPFLAGS = $(AM_CPPFLAGS) -DTESS_EXPORTS
  tesseract_CPPFLAGS = $(AM_CPPFLAGS) -DTESS_I.MPORTS

BTW,

 include_HEADERS = \
     apitypes.h baseapi.h tesseractmain.h

but tesseractmain.h isn't a public header so shouldn't it instead be
added to tesseract_SOURCES? Automake's "9.2 Header files" section [1]
says:

  "Usually, only header files that accompany installed libraries need
  to be installed. Headers used by programs or convenience libraries
  are not installed. The noinst_HEADERS variable can be used for such
  headers. However when the header actually belongs to a single
  convenience library or program, we recommend listing it in the
  program's or library's _SOURCES variable (see Program Sources)
  instead of in noinst_HEADERS."

Aha! Great information!
applied in 692 - installed are only those header files that were identified by Tom python script [2]. If more files need to be installed you can adapt Makefile.am in particular directory, or just let me know what should be included in installation.

[2] http://code.google.com/p/tesseract-ocr/source/browse/trunk/vs2008/tesshelper.py

zdenko podobny

unread,
Mar 3, 2012, 5:08:29 PM3/3/12
to tesser...@googlegroups.com
...libtool does several things for you: it links with the shared archive rather than the static archive. [1]

...This will choose either the static or shared archive from the `libshell.la' Libtool library depending on the target host and any Libtool mode switches mentioned in the`Makefile.am', or passed to configure. [2]

My experience: I can have static linked executable only if I request it. Automatically I got shared linked programs... Some linux distribution even stopped to ship static libs....

Tom Powers

unread,
Mar 3, 2012, 7:09:53 PM3/3/12
to tesseract-dev
2012/3/3 Zdenko Podobný <zde...@gmail.com>:

> applied in 692 - installed are only those header files that were identified
> by Tom python script [2]. If more files need to be installed you can adapt
> Makefile.am in particular directory, or just let me know what should be
> included in installation.

Looking at zdenko's latest r693, I was surprised that tesseractmain.h
still does:

#include "params.h"
#include "blobs.h"
#include "notdll.h"

because I know that in my APITest VS2008 Solution, I explicitly did
*not* include those headers since they are not required to build
tesseract and not in the "public" 13:

api\apitypes.h
api\baseapi.h
ccmain\thresholder.h
ccstruct\publictypes.h
ccutil\errcode.h
ccutil\fileerr.h
ccutil\host.h
ccutil\memry.h
ccutil\platform.h
ccutil\serialis.h
ccutil\strngs.h
ccutil\tesscallback.h
ccutil\unichar.h

I also wondered how he was able to correctly build, when he now uses
tprintf() in tesseract. The answer is blobs.h eventually includes
tprintf.h and api/Makefile.am is, IMO, incorrectly letting the gcc
compiler poke around in a bunch of tesseract-ocr subdirs looking for
headers.

If we are really going to be "eating what we cook", then we should be
building tesseract (*and* the training apps) in the same kind of
environment as any other project using libtesseract. We have to assume
that the only headers we can see are the "public" headers. This is
exactly analogous to only being able to see visible symbols in the
libtesseracts shared library (instead of everything).

I'm admittedly not sure of the best way to do this. Do we make a new
include subdir, add it to the list of directories to search when
building libtesseract, and specify *only* that directory when building
apps that link with libtesseract?

In r693, zdenko added TESS_API visibility to tprintf() in
ccutil/tprintf.h. This is a good example of the impact of such a
change.

1) He should first of all, include platform.h (which is where
TESS_API is defined) inside tprintf.h.

2) He *also* has to make sure tprintf.h is a public
header. Unfortunately, tprintf.h includes params.h, params.h includes
genericvector.h (and so on). This is where things get a bit hairy.
Hopefully he really doesn't need to include params.h and can somehow
get around this by refactoring -- I haven't looked at tprintf.cpp
very closely.

3) He should explicitly include tprintf.h in either tesseractmain.cpp
(my preference) or tesseractmain.h.

4) He has to update my tesshelper.py program to add tprintf.h to the list of
public headers.

And, of course, this is still avoiding the issue that the TessBaseAPI
class currently refers to objects that the caller can do nothing
useful with (just to give two examples from api/baseapi.h:

class PageIterator;
PageIterator* AnalyseLayout();

class ResultIterator;
ResultIterator* GetIterator();

I haven't finished my APIExamples Solution yet, so I don't know if
there are other ways to get the same information from other methods.
Either we make PageIterator and ResultIterator visible, or we should
remove them from TessBaseAPI. This problem has already come up in the
tesseract-ocr newsgroup.

No one said adding visibility support was going to be painless :)

-- Tom

Tom Powers

unread,
Mar 3, 2012, 7:50:07 PM3/3/12
to tesser...@googlegroups.com
On Sat, Mar 3, 2012 at 2:08 PM, zdenko podobny <zde...@gmail.com> wrote:
> My experience: I can have static linked executable only if I request it.
> Automatically I got shared linked programs... Some linux distribution even
> stopped to ship static libs....

Solving Issue 287 and its concern with the number of exported symbols
was one of the motivating factors for addressing the visibility
problem (along with fixing undefined external errors when building the
Windows DLL).

However, just addressing libtesseract's external symbols probably
won't be enough for the original poster. The fact remains that liblept
also currently exports 2200+ symbols.

One option that comes to mind, is to support building a shared library
libtesseract that *statically* links with liblept. This removes the
public dependence on liblept.so and doesn't increase libtesseract's
visible symbol count.

Of course, then users of this library will not be able to use any
liblept functions directly unless they also statically link with
liblept. I'm pretty sure this is safe, since there isn't any
initialization involved with using liblept functions. And from my own
experience, I know that programs that statically link with liblept and
its dependent image libraries can be surprisingly tiny.

The result would be, I believe, the smallest "working set" for the OP?

Given that, should we really contemplate not providing a static
library version of libtesseract? Maybe some other program would like
to link statically with it, in the same way that linking statically
with liblept is sometimes helpful?

I'm not sure what happens if a project links with a shared library
that statically links with libtesseract, at the same time it also
statically links with libtesseract directly?

-- Tom

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=287

Tom Powers

unread,
Mar 3, 2012, 7:56:37 PM3/3/12
to tesser...@googlegroups.com
2012/3/3 Zdenko Podobný <zde...@gmail.com>:

> In case of '--enable-visibility' training programs are static linked so
> their size is huge (80M static build vs 2,2M shared build). If size matter
> that programs could be stripped ('strip'[1] or 'make install-strip')

As far as final size goes, perhaps the first answer to "C/C++ gcc & ld
- remove unused symbols" [1] will help?

-- Tom

[1] http://stackoverflow.com/questions/6687630/c-c-gcc-ld-remove-unused-symbols

Ray Smith

unread,
Mar 3, 2012, 9:34:49 PM3/3/12
to tesser...@googlegroups.com
PageIterator and ResultIterator should *definitely* be added to the list of exported/visible classes.
These classes provide the main API to get at detailed recognition results.
Conversely, forget TesseractExtractResult and everything else below the OCROpus cut-line.
Also ETEXT_DESC is not a supported way of getting at the results.

I would think it silly to statically link leptonica to tesseract, as the preferred way to give tesseract an image is as a Pix, and you can't do that if leptonica is hidden. It would only make sense if you could provide at least a basic set of leptonica functionality to the outside world.

Tom Powers

unread,
Mar 4, 2012, 2:22:39 AM3/4/12
to tesser...@googlegroups.com
Attached is my patch to r693 with the following proposed changes:

+ fixes issue [1] where boolean was being compared to float
+ removes extra includes from tesseractmain.h
+ removes extra DLLEXPORT & DLLIMPORT from hosts.h
+ remove CCUTIL_IMPORTS & CCUTIL_EXPORTS from vs2008 *.vcproj.
+ tesseract prints full version info when -v arg used:

vs2008\DLL_Release>tesseract-dll.exe -v
tesseract 3.02
leptonica-1.68 (Feb 21 2012, 05:25:30) [MSC v.1500 DLL Release 32 bit]
libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5

or:

vs2008\LIB_Release\tesseract.exe -v
tesseract 3.02
leptonica-1.68 (Feb 21 2012, 05:29:12) [MSC v.1500 LIB Release 32 bit]
libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5

This is very helpful when answering support questions. IMO, all the
training apps should also do this.

More info on static linking: here's the size of the Windows
LIB_Release executables which are statically linked with libtesseract
& liblept:

1,092,096 ambiguous_words.exe
1,311,232 classifier_tester.exe
616,448 cntraining.exe
580,608 combine_tessdata.exe
593,408 dawg2wordlist.exe
952,832 mftraining.exe
878,080 shapeclustering.exe
2,349,568 tesseract.exe
585,216 unicharset_extractor.exe
677,376 wordlist2dawg.exe

And the size of the tesseract DLL version and the Release libraries:

12,288 tesseract-dll.exe

1,554,432 libtesseract302.dll
1,672,192 liblept168.dll

14,674,982 libtesseract302-static.lib

2,519,302 liblept168-static-mtdll.lib
70,450 giflib416-static-mtdll.lib
363,212 libjpeg8c-static-mtdll.lib
331,028 libpng143-static-mtdll.lib
1,777,404 libtiff394-static-mtdll.lib
199,940 zlib125-static-mtdll.lib

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=573

2012-03-03.patch

James Le Cuirot

unread,
Mar 4, 2012, 4:34:17 AM3/4/12
to tesser...@googlegroups.com
On Sat, 3 Mar 2012 16:50:07 -0800
Tom Powers <tomp...@gmail.com> wrote:

> However, just addressing libtesseract's external symbols probably
> won't be enough for the original poster. The fact remains that liblept
> also currently exports 2200+ symbols.

It would be worth bringing Dan Bloomberg into the conversation.
Although he wanted to keep his static makefiles, he was very grateful
for my autotools improvements and we have continued to discuss various
issues since.

> I'm not sure what happens if a project links with a shared library
> that statically links with libtesseract, at the same time it also
> statically links with libtesseract directly?

I'm not sure whether you get an error or whether the direct link takes
precedence but even if it's the latter, this is very bad. If the
library versions are different or are built with different options then
things will almost certainly break. I once tried to force a proprietary
program to use my system Qt libraries. It eventually crashed and burned
because my Qt was built against libpng 1.5 while its Qt was built
against 1.4, which it was also using directly.

Regards,
James

Tom Powers

unread,
Mar 4, 2012, 9:20:17 PM3/4/12
to tesser...@googlegroups.com
On Sat, Mar 3, 2012 at 6:34 PM, Ray Smith <thera...@gmail.com> wrote:
> PageIterator and ResultIterator should *definitely* be added to the list of
> exported/visible classes.
> These classes provide the main API to get at detailed recognition results.

Helpful tip for determining header file dependencies, see [1] for details:

IDIRS="-I../api -I../ccmain -I../ccstruct -I../ccutil -I../classify
-I../cube -I../cutil -I../dict -I../image -I../neural_networks/runtime
-I../textord -I../viewer -I../wordrec"

tesseract-3.02apha/ccmain$ gcc $IDIRS -MM pageiterator.h
pageiterator.o: pageiterator.h ../ccstruct/publictypes.h

tesseract-3.02apha/ccmain$ gcc $IDIRS -MM resultiterator.h
resultiterator.o: resultiterator.h ltrresultiterator.h pageiterator.h \
../ccstruct/publictypes.h ../ccutil/unicharset.h ../ccutil/strngs.h \
../ccutil/platform.h ../ccutil/memry.h ../ccutil/host.h \
../ccutil/serialis.h ../ccutil/errcode.h ../ccutil/fileerr.h \
../ccutil/unichar.h ../ccutil/unicharmap.h ../ccutil/params.h \
../ccutil/genericvector.h ../ccutil/tesscallback.h ../ccutil/helpers.h \
../ccutil/ndminx.h ../ccutil/genericvector.h

tesseract-3.02apha/api$ gcc $IDIRS -MM baseapi.h
baseapi.o: baseapi.h apitypes.h ../ccstruct/publictypes.h \
../ccmain/thresholder.h ../ccutil/unichar.h ../ccutil/tesscallback.h \
../ccutil/host.h ../ccutil/platform.h

tesseract-3.02apha/ccutil$ gcc $IDIRS -MM strngs.h
strngs.o: strngs.h platform.h memry.h host.h serialis.h errcode.h \
fileerr.h

So, to add initial visibility for the PageIterator & ResultIterator
classes, to the original 13 public headers:

api\apitypes.h
api\baseapi.h
ccmain\thresholder.h
ccstruct\publictypes.h
ccutil\errcode.h
ccutil\fileerr.h
ccutil\host.h
ccutil\memry.h
ccutil\platform.h
ccutil\serialis.h
ccutil\strngs.h
ccutil\tesscallback.h
ccutil\unichar.h

we need to only add the following 6 headers:

ccutil/genericvector.h
ccutil/helpers.h
ccutil/ndminx.h
ccutil/params.h
ccutil/unicharmap.h
ccutil/unicharset.h

Of course, the PageIterator and ResultIterator header files forward
declare other classes that may also need to be made visible.

[1] http://gcc.gnu.org/onlinedocs/cpp/Invocation.html#Invocation

          -- Tom

Tom Powers

unread,
Mar 4, 2012, 9:34:12 PM3/4/12
to tesser...@googlegroups.com
On Sun, Mar 4, 2012 at 6:20 PM, Tom Powers <tomp...@gmail.com> wrote:
> we need to only add the following 6 headers:
>
>     ccutil/genericvector.h
>     ccutil/helpers.h
>     ccutil/ndminx.h
>     ccutil/params.h
>     ccutil/unicharmap.h
>     ccutil/unicharset.h

Ooops, should be:

we need to only add the following 8 headers:

ccmain/pageiterator.h
ccmain/resultiterator.h


ccutil/genericvector.h
ccutil/helpers.h
ccutil/ndminx.h
ccutil/params.h
ccutil/unicharmap.h
ccutil/unicharset.h

          -- Tom

Tom Powers

unread,
Mar 4, 2012, 9:39:38 PM3/4/12
to tesser...@googlegroups.com
On Sun, Mar 4, 2012 at 6:20 PM, Tom Powers <tomp...@gmail.com> wrote:
> we need to only add the following 6 headers:
>
>     ccutil/genericvector.h
>     ccutil/helpers.h
>     ccutil/ndminx.h
>     ccutil/params.h
>     ccutil/unicharmap.h
>     ccutil/unicharset.h

Arrrgh! One last time, it should be

we need to only add the following 9 headers:

ccmain/ltrresultiterator.h
ccmain/pageiterator.h
ccmain/resultiterator.h


ccutil/genericvector.h
ccutil/helpers.h
ccutil/ndminx.h
ccutil/params.h
ccutil/unicharmap.h
ccutil/unicharset.h

          -- Tom

Zdenko Podobný

unread,
Mar 6, 2012, 5:40:01 PM3/6/12
to tesser...@googlegroups.com
committed in r700

Tom Powers

unread,
Mar 7, 2012, 12:13:47 PM3/7/12
to tesser...@googlegroups.com
Attached see my patch for the following changes:

+ Remove visibility from protected members of tesseract::TessBaseAPI
class by applying TESS_LOCAL macro.

+ Make PageIterator & ResultIterator classes visible by applying TESS_API macro.

+ Fix api/Makefile.am & training/Makefile.am since build dir is not
same as source dir when building from "external" dir.

Tested on Ubuntu 11.10 via:

cd ~/Builds/tesseract-3.02apha/
./autogen.sh
cd ../Output/tesseract-3.02/
../../tesseract-3.02apha/configure --enable-visibility
make
api/tesseract ../../tesseract-3.02apha/eurotext.tif eurotext

(training apps fail since still have undefined references)

After protected members removed from TessBaseAPI class:

~/Builds/Output/tesseract-3.02$ nm -C -D --defined-only
api/.libs/libtesseract.so.3.0.2 | wc -l
173

After PageIterator & ResultIterator classes made visible:

~/Builds/Output/tesseract-3.02$ nm -C -D --defined-only
api/.libs/libtesseract.so.3.0.2 | wc -l
230

-- Tom

2012-03-07MakePageIterator-ResultIteratorVisible.patch

Tom Powers

unread,
Mar 8, 2012, 12:30:22 AM3/8/12
to tesser...@googlegroups.com
Attached see my patch for the following changes:

+ fix VS2008 warning about "non dll-interface class
tesseract::LTRResultIterator used as base for dll-interface class
tesseract::ResultIterator" by making LTRResultIterator also visible.

+ Changed Project preprocessor definition of WINDLLNAME, because
stringizing operator doesn't seem to work when initializing
tessedit_module_name in ccutil/ccutil.cpp (which was omitted in
previous fixes).

+ Update vs2008/tesshelper.py for new public header files.

          -- Tom

2012-03-07WINDOWDLLfix.patch

Ray Smith

unread,
Mar 8, 2012, 11:53:20 PM3/8/12
to tesser...@googlegroups.com
Hi Tom,

I added you as a committer, so you can just check in your updates directly to the svn repository instead of having to prepare patches.
Thanks!
Ray.

Tom Powers

unread,
Mar 9, 2012, 6:44:15 AM3/9/12
to tesser...@googlegroups.com
On Thu, Mar 8, 2012 at 8:53 PM, Ray Smith <thera...@gmail.com> wrote:
> Hi Tom,
>
> I added you as a committer, so you can just check in your updates directly
> to the svn repository instead of having to prepare patches.
> Thanks!
> Ray.

Thanks. Actually, I sort of liked having Zdenko look over my patches
before they went live :) I'll start small initially to make sure I
don't mess the repository up.

OTOH, I have a number of changes to add for the vs2008/doc directory,
and being able to update directly will make that process easier.

-- Tom

Reply all
Reply to author
Forward
0 new messages