Windows packages (Was: 64-Bit Windows DLL)

983 views
Skip to first unread message

zdenko podobny

unread,
Sep 28, 2012, 5:09:01 AM9/28/12
to tesser...@googlegroups.com
On Mon, Sep 24, 2012 at 9:57 AM, troplin <tro...@gmail.com> wrote:
Hello,
 
since the installer for Windows will contain the libtesseract DLL in the next version (v 3.02),

I realised that there was not discusssion about this yet and I thoght that installer will include only statically linked tesseract... My understanding is that end users (interested in installer) do not need dll (maybe I am wrong).

On other hand (windows) programmers will need more than dll - for them there should be packages like tesseract-ocr-3.02-vs2008.zip (solution files), tesseract-3.02-win32-lib-include-dirs.zip (library with include files) , maybe package with example how to use API.

What is your option?

--
Zdenko

Tom Powers

unread,
Sep 28, 2012, 6:23:14 AM9/28/12
to tesser...@googlegroups.com
On Fri, Sep 28, 2012 at 2:09 AM, zdenko podobny <zde...@gmail.com> wrote:
> On Mon, Sep 24, 2012 at 9:57 AM, troplin <tro...@gmail.com> wrote:
>>
>> Hello,
>>
>> since the installer for Windows will contain the libtesseract DLL in the
>> next version (v 3.02),
>
>
> I realised that there was not discusssion about this yet and I thoght that
> installer will include only statically linked tesseract... My understanding
> is that end users (interested in installer) do not need dll (maybe I am
> wrong).

I haven't tried to build tesseract in months, but isn't it still
impossible to build any of the training apps using the DLL version of
libtesseract [1]? As such, I see little point in releasing a DLL
version of tesseract.exe.

> On other hand (windows) programmers will need more than dll - for them there
> should be packages like tesseract-ocr-3.02-vs2008.zip (solution files),
> tesseract-3.02-win32-lib-include-dirs.zip (library with include files) ,
> maybe package with example how to use API.

I agree. And this brings up just one of the issues I mentioned months
ago [2]. There is a distinction between public & private headers.
Public headers need to be released in there own separate include
directory. I suggested that they go in BuildFolder\include\tesseract
and provided a python program called tesshelper.py to automatically
copy the relevant headers.

However, the current source tree spreads these headers out over
numerous directories and no distinction is made between headers which
need to be public. This leads to errors such as discussed in "Issue
362 - unresolved external symbol" [3].

Shouldn't we change things so that all public headers instead go into
a single directory? I am admittedly ignorant on how all this impacts
linux developers who plan to use libtesseract (shared or otherwise).

Now that we are closer to actually releasing v3.02 and people seem to
be actually addressing the outstanding issues (I was pleasantly
surprised at recent the flurry of source checkins), I would suggest
that the many questions I raised in [2] still need to be answered
(irregardless of the separate work on the C-API which I think was a
bit premature).

Sheepishly, I gave up on my overly ambitious plans mentioned in that
thread when it resulted in zero responses (and I got a little
overwhelmed at the complexity and frankly strange behavior of some of
the more esoteric baseapi functions). However, given the recent uptick
in activity, I guess I'll fire up TortoiseSVN, VS2008, and Ubuntu
again (have people been testing the --enable-visibility flag?) this
weekend and see if I can make more progress.

Slightly offtopic. Has anyone read:

"API Design for C++" by Martin Reddy
http://www.amazon.com/API-Design-C-Martin-Reddy/dp/0123850037/
Paperback: 472 pages
Publisher: Morgan Kaufmann; 1 edition (February 18, 2011)
http://APIBook.com/

Reading it is another thing that's been on my todo list for months :)
Certainly the TOC looks interesting [4].

[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/building.html#building-the-training-applications

[2] "Visibility" support summary and future work
https://groups.google.com/forum/?fromgroups=#!topic/tesseract-dev/kcBEJY0s9H8

[3] https://groups.google.com/forum/?fromgroups=#!topic/tesseract-dev/S8w7cfzr4kE

[4] http://www.apibook.com/blog/contents

-- Tom

troplin

unread,
Sep 28, 2012, 8:19:04 AM9/28/12
to tesser...@googlegroups.com
Yes, as a programmer I need the tesseract-ocr-3.02-vs2008.zip.
But once I have finished programming, I want to distribute the product to the customer.
 
At the moment, we have our custom installer for Tesseract, containing only the DLL and the tessdata directory. Our application just tries to load the DLL, and if it succeed it can use it.
Bundling tesseract directly to our product is no feasible solution, since it's just too much data and most of our customers don't use it anyway.
 
My hope was, that in version 3.02 we could just tell our customers to use the official installer to install the DLL and data files.
The tesseract executable can of course still be statically linked, if this suits better for performance of stability reasons.
 
Tobi

troplin

unread,
Sep 28, 2012, 8:39:10 AM9/28/12
to tesser...@googlegroups.com

Am Freitag, 28. September 2012 12:23:36 UTC+2 schrieb Tom Powers:

I haven't tried to build tesseract in months, but isn't it still
impossible to build any of the training apps using the DLL version of
libtesseract [1]? As such, I see little point in releasing a DLL
version of tesseract.exe.
 
Actually, I don't care if the executable is statically linked or dynamically. I just hoped that the DLL would be included.
 
> On other hand (windows) programmers will need more than dll - for them there
> should be packages like tesseract-ocr-3.02-vs2008.zip (solution files),
> tesseract-3.02-win32-lib-include-dirs.zip (library with include files) ,
> maybe package with example how to use API.

I agree. And this brings up just one of the issues I mentioned months
ago [2]. There is a distinction between public & private headers.
Public headers need to be released in there own separate include
directory. I suggested that they go in BuildFolder\include\tesseract
and provided a python program called tesshelper.py to automatically
copy the relevant headers.

However, the current source tree spreads these headers out over
numerous directories and no distinction is made between headers which
need to be public. This leads to errors such as discussed in "Issue
362 - unresolved external symbol" [3].
 
This error was not caused because the headers are in different directories, but because the class was not exported from the DLL.
But I agree about making the distinction between public an private header. This makes it much easier to export the right classes and functions.
 
Shouldn't we change things so that all public headers instead go into
a single directory?
 
Agreed
 

I am admittedly ignorant on how all this impacts
linux developers who plan to use libtesseract (shared or otherwise).

Isn't that how it is usually done on linux anyway? 
 
Now that we are closer to actually releasing v3.02 and people seem to
be actually addressing the outstanding issues (I was pleasantly
surprised at recent the flurry of source checkins), I would suggest
that the many questions I raised in [2] still need to be answered
(irregardless of the separate work on the C-API which I think was a
bit premature).
 
Why do you think the work on the C-API was premature? I had to do it anyway, so why not release it to the public? Granted, to create the full API was more that I really needed...
Actually, the C-API is quite clean and selfcontained. I has only one header dependency (to the platform.h), and it is really clear what is private and what public.
The C-API is essential for me and a precondition to contribute to tesseract.
 
Sheepishly, I gave up on my overly ambitious plans mentioned in that
thread when it resulted in zero responses (and I got a little
overwhelmed at the complexity and frankly strange behavior of some of
the more esoteric baseapi functions). However, given the recent uptick
in activity, I guess I'll fire up TortoiseSVN, VS2008, and Ubuntu
again (have people been testing the --enable-visibility flag?) this
weekend and see if I can make more progress.
 
I think this is an incremental process and it goes into the right direction now.
BTW I like your summary in that thread.
 
Tobi

Ray Smith

unread,
Sep 28, 2012, 12:59:52 PM9/28/12
to tesser...@googlegroups.com
I would have thought that windows developers would want a tesseract DLL for integrating into some run-time application, but they have absolutely no real need to be able to build the training tools using the DLL, as they are used only for training.

Putting all the includes necessary for export to a DLL into a separate directory breaks the dependency hierarchy and creates circular dependencies.
Recognita is a prime example of this. Because all the low-level code uses the (Recognita) API, which depends on everything, everything depends on everything.
It took a lot of work to clean up the dependencies in Tesseract to a clean hierarchy.
A far better solution would be to have a separate header file in the api directory that includes all the headers needed by anything that uses the DLL. It would cut down on the number of includes that need to be made, but probably not having to specify a long list of directories in the build tools for apps that use the DLL. For that there is copy-paste. Even this header should not be included in baseapi.h though, because of the namespace pollution problem.

BaseAPI includes a lot of crap that isn't needed by real apps that use the DLL. The idea behind the ResultIterator and friends is that BaseAPI users shouldn't need a huge number of includes. DLL-based apps shouldn't be touching PAGE_RES for instance, even though access to it is exposed by some of the API functions.

zdenko podobny

unread,
Sep 28, 2012, 4:18:51 PM9/28/12
to tesser...@googlegroups.com
On Fri, Sep 28, 2012 at 12:23 PM, Tom Powers <tomp...@gmail.com> wrote:
On Fri, Sep 28, 2012 at 2:09 AM, zdenko podobny <zde...@gmail.com> wrote:
> On Mon, Sep 24, 2012 at 9:57 AM, troplin <tro...@gmail.com> wrote:
>>
>> Hello,
>>
>> since the installer for Windows will contain the libtesseract DLL in the
>> next version (v 3.02),
>
>
> I realised that there was not discusssion about this yet and I thoght that
> installer will include only statically linked tesseract... My understanding
> is that end users (interested in installer) do not need dll (maybe I am
> wrong).

I haven't tried to build tesseract in months, but isn't it still
impossible to build any of the training apps using the DLL version of
libtesseract [1]? As such, I see little point in releasing a DLL
version of tesseract.exe.

training apps can be created only with static linking.
 
> On other hand (windows) programmers will need more than dll - for them there
> should be packages like tesseract-ocr-3.02-vs2008.zip (solution files),
> tesseract-3.02-win32-lib-include-dirs.zip (library with include files) ,
> maybe package with example how to use API.

I agree. And this brings up just one of the issues I mentioned months
ago [2]. There is a distinction between public & private headers.
Public headers need to be released in there own separate include
directory. I suggested that they go in BuildFolder\include\tesseract
and provided a python program called tesshelper.py to automatically
copy the relevant headers.

I follow your investigation in autotools build - the same header will be installed by autotools.

However, the current source tree spreads these headers out over
numerous directories and no distinction is made between headers which
need to be public. This leads to errors such as discussed in "Issue
362 - unresolved external symbol" [3].

Shouldn't we change things so that all public headers instead go into
a single directory? I am admittedly ignorant on how all this impacts
linux developers who plan to use libtesseract (shared or otherwise).

On linux (or autotools build ;-)) all headers goes (went) to single directory $PREFIX/include/tesseract. I think it is great that the same situation will be on VC++ build. It should make easy life for multi-platform code. 
 
Now that we are closer to actually releasing v3.02 and people seem to
be actually addressing the outstanding issues (I was pleasantly
surprised at recent the flurry of source checkins), I would suggest
that the many questions I raised in [2] still need to be answered
(irregardless of the separate work on the C-API which I think was a
bit premature).

I do not think C-API work is premature. Official C-API will make life easier for those who use C, python (java...). It is just another option offered to developers. And of course any idea how to improve is welcomed.
 
Sheepishly, I gave up on my overly ambitious plans mentioned in that
thread when it resulted in zero responses (and I got a little
overwhelmed at the complexity and frankly strange behavior of some of
the more esoteric baseapi functions). However, given the recent uptick
in activity, I guess I'll fire up TortoiseSVN, VS2008, and Ubuntu
again (have people been testing the --enable-visibility flag?) this
weekend and see if I can make more progress.

I tested it/used it shortly after implementation and I did not see any problem with it (well if we omit training apps that would cause that almost everything is visible ;-) ). Maybe because of scope of my interest with tesseract.
 
Slightly offtopic. Has anyone read:

    "API Design for C++" by Martin Reddy
    http://www.amazon.com/API-Design-C-Martin-Reddy/dp/0123850037/
    Paperback: 472 pages
    Publisher: Morgan Kaufmann; 1 edition (February 18, 2011)
    http://APIBook.com/

Reading it is another thing that's been on my todo list for months :)
Certainly the TOC looks interesting [4].

[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/building.html#building-the-training-applications

[2] "Visibility" support summary and future work
https://groups.google.com/forum/?fromgroups=#!topic/tesseract-dev/kcBEJY0s9H8

[3] https://groups.google.com/forum/?fromgroups=#!topic/tesseract-dev/S8w7cfzr4kE

[4] http://www.apibook.com/blog/contents

    -- Tom

--
Zdenko

troplin

unread,
Sep 30, 2012, 5:24:47 AM9/30/12
to tesser...@googlegroups.com
Am Freitag, 28. September 2012 18:59:53 UTC+2 schrieb Ray:
I would have thought that windows developers would want a tesseract DLL for integrating into some run-time application, but they have absolutely no real need to be able to build the training tools using the DLL, as they are used only for training.

That was my intent.
What's the plan now?
 
Putting all the includes necessary for export to a DLL into a separate directory breaks the dependency hierarchy and creates circular dependencies.

How does the directory structure change dependencies?
 
Recognita is a prime example of this. Because all the low-level code uses the (Recognita) API, which depends on everything, everything depends on everything.

If low-level code depends on the high-level API, there's something wrong with it.
Maybe then it would be helpful to create some kind of base library, containing all really basic stuff like vectors, arrays, lists etc. Everything that is needed by the high-level API as well as the low-level code, but is not itself high-level.
 
It took a lot of work to clean up the dependencies in Tesseract to a clean hierarchy.
A far better solution would be to have a separate header file in the api directory that includes all the headers needed by anything that uses the DLL. It would cut down on the number of includes that need to be made, but probably not having to specify a long list of directories in the build tools for apps that use the DLL. For that there is copy-paste. Even this header should not be included in baseapi.h though, because of the namespace pollution problem.

I still don't get what you mean with those public headers that are not really public and should not be included in baseapi.h... Either you need it for baseapi.h, or they are not needed at all.
Are those additional classes/method neede by the training applications? If yes, you should probably also design a public training API in addition to the existing public recognition API.
 
BaseAPI includes a lot of crap that isn't needed by real apps that use the DLL. The idea behind the ResultIterator and friends is that BaseAPI users shouldn't need a huge number of includes. DLL-based apps shouldn't be touching PAGE_RES for instance, even though access to it is exposed by some of the API functions.

Tesseract v3 API is already much better than v2 was. Especially due to those Iterators.

Tobi

troplin

unread,
Dec 27, 2012, 6:02:51 AM12/27/12
to tesser...@googlegroups.com


Am Freitag, 28. September 2012 11:09:32 UTC+2 schrieb Zdenko Podobný:
Sorry to be annoying but is there any chance that we could include the Windows DLL in the installer? Maybe in the next update? Or a minor patch to 3.02?
I would do it myself, but I have absolutely no idea how the windows installer is built. Maybe it would be good to add the installer templates to the repository, so that they are versioned too and can be modified by everyone.

Just to restate my reasons:
We (optionally) use the Tesseract DLL in our application. Ideally we could just check if the official tesseract DLL is installed an use that one.
Currently we have our own tesseract installer for windows that includes the DLL, but this is a worthless duplication of efforts. It's also a bit confusing, some of our customers already have installed the official tesseract package and expect it to work out of the box. It's a bit strange that they have to install it a second time.

I also don't think that this is an exotic setup. With the API getting more stable and useful, other applications may start using it instead of launching the command line tool.

Tobi

zdenko podobny

unread,
Dec 27, 2012, 2:56:35 PM12/27/12
to tesser...@googlegroups.com
On Thu, Dec 27, 2012 at 12:02 PM, troplin <tro...@gmail.com> wrote:


Am Freitag, 28. September 2012 11:09:32 UTC+2 schrieb Zdenko Podobný:
On Mon, Sep 24, 2012 at 9:57 AM, troplin <tro...@gmail.com> wrote:
Hello,
 
since the installer for Windows will contain the libtesseract DLL in the next version (v 3.02),

I realised that there was not discusssion about this yet and I thoght that installer will include only statically linked tesseract... My understanding is that end users (interested in installer) do not need dll (maybe I am wrong).

On other hand (windows) programmers will need more than dll - for them there should be packages like tesseract-ocr-3.02-vs2008.zip (solution files), tesseract-3.02-win32-lib-include-dirs.zip (library with include files) , maybe package with example how to use API.

What is your option?

Sorry to be annoying but is there any chance that we could include the Windows DLL in the installer? Maybe in the next update? Or a minor patch to 3.02?
 
Can you be please more specific what do you mean by "include the Windows DLL in the installer"? Are there any problem with installling tesseract libraries by installer?
 
I would do it myself, but I have absolutely no idea how the windows installer is built. Maybe it would be good to add the installer templates to the repository, so that they are versioned too and can be modified by everyone.

Commited as r815. 
tesseract-ocr-setup-3.02.02.png

troplin

unread,
Jan 3, 2013, 3:35:20 AM1/3/13
to tesser...@googlegroups.com
Am Donnerstag, 27. Dezember 2012 20:56:35 UTC+1 schrieb Zdenko Podobný:
On Thu, Dec 27, 2012 at 12:02 PM, troplin <tro...@gmail.com> wrote:
Sorry to be annoying but is there any chance that we could include the Windows DLL in the installer? Maybe in the next update? Or a minor patch to 3.02?
 
Can you be please more specific what do you mean by "include the Windows DLL in the installer"? Are there any problem with installling tesseract libraries by installer?
 
Oh, I didn't notice that.
Previously, the answer seemed to be 'no', so I didn't even try.
But actually it doesn't work, I'm getting an error 404 when it tries to download the libtesseract libary. liblept works however.
 
Also, when installing all language data, I'm getting some additional errors (some 404 and sometimes it tries to overwrite existing files).
 
Also I think the installer is mainly designed for developers instead of users. I would group the contents in the following way:
 
(x) Install by default
(-) Mixed content
( ) Don't install by default
  • (-) Components required for normal use
    • (x) Executable
    • (x) Release-DLLs without headers. (libtesseract302.dll, liblept168.dll) Only one bullet point, since libtesseract does not work without liblept
    • (-) Language data
      • (x) English
      • (x) OSD
      • ( ) Other languages
    • (x) Basic documentation for users
  • ( ) Components required for advanced users (Everything required for customization of recognition)
    • ( ) Training tools
    • ( ) Tools for manipulating language data
    • ( ) Documentation for those tools
    • ...
  • ( ) Components required for developers that use tesseract in their products
    • ( ) Public tesseract and leptonica headers.
    • ( ) Debug DLLs
    • ( ) Tesseract and leptonica stub libraries (libtesseract302.lib, liblept168.lib). Those used for linking to the DLLs.
    • ( ) API documentation / doxygen
    • ...
  • ( ) Components required for tesseract developers
    • ( ) Tesseract source
    • ( ) static libraries
    • ( ) VS 2008 tools
    • ...
What do you think about? Did I forget something?
 
If I find some time, I will get my hands dirty and try to modify the template.
 
I would do it myself, but I have absolutely no idea how the windows installer is built. Maybe it would be good to add the installer templates to the repository, so that they are versioned too and can be modified by everyone.
 
Commited as r815. 
 
Great! 
 
Some really minor nitpicks concerning the installer:
  • I find the behavior of the feature checkboxes in the installer really counterintuitive. Normally you get the description by clicking on a feature, not by hovering over. But here clicking on the text toggles the selection.
  • The installer succedes even if there are errors, and your really don't know what actually succeeded and what not.
  • No repair feature, just uninstall and install.
  • Executable instead of MSI
All of those points are probably because of the use of NSIS, which i'm not so fond of.
However, I must acknowledge that you probably don't want to invest much time learning WiX. And I'm not a WiX expert neither, so NSIS is probably the best solution currently.
 
The reason I tell you those points in the first place is, that with tesseract getting better and easier to use, it will attract professional users (Professional in the sense of money, not competence). A standard MSI just looks more professional than a custom installer. And you can use the scripting fuctionality of msiexec.
 
Tobi

zdenko podobny

unread,
Jan 6, 2013, 1:42:28 PM1/6/13
to tesser...@googlegroups.com
On Thu, Jan 3, 2013 at 9:35 AM, troplin <tro...@gmail.com> wrote:
Am Donnerstag, 27. Dezember 2012 20:56:35 UTC+1 schrieb Zdenko Podobný:
On Thu, Dec 27, 2012 at 12:02 PM, troplin <tro...@gmail.com> wrote:
Sorry to be annoying but is there any chance that we could include the Windows DLL in the installer? Maybe in the next update? Or a minor patch to 3.02?
 
Can you be please more specific what do you mean by "include the Windows DLL in the installer"? Are there any problem with installling tesseract libraries by installer?
 
Oh, I didn't notice that.
Previously, the answer seemed to be 'no', so I didn't even try.
But actually it doesn't work, I'm getting an error 404 when it tries to download the libtesseract libary. liblept works however.
 
Also, when installing all language data, I'm getting some additional errors (some 404 and sometimes it tries to overwrite existing files).
 
Also I think the installer is mainly designed for developers instead of users. I would group the contents in the following way:
 
(x) Install by default
(-) Mixed content
( ) Don't install by default
  • (-) Components required for normal use
    • (x) Executable
    • (x) Release-DLLs without headers. (libtesseract302.dll, liblept168.dll) Only one bullet point, since libtesseract does not work without liblept
I am not sure if this is need for as default - tesseract is linked statically (because of training programs - search archive for reason) So common user do not need it.
    • (-) Language data
      • (x) English
      • (x) OSD
      • ( ) Other languages
    • (x) Basic documentation for users
I forget to include manual pages (html files in doc directory). Maybe download of pdf documentation files (from svn repository) could be other option... 
  • ( ) Components required for advanced users (Everything required for customization of recognition)
    • ( ) Training tools
    • ( ) Tools for manipulating language data
    • ( ) Documentation for those tools
    • ...
  • ( ) Components required for developers that use tesseract in their products
    • ( ) Public tesseract and leptonica headers.
    • ( ) Debug DLLs
    • ( ) Tesseract and leptonica stub libraries (libtesseract302.lib, liblept168.lib). Those used for linking to the DLLs.
    • ( ) API documentation / doxygen
    • ...
  • ( ) Components required for tesseract developers
    • ( ) Tesseract source
    • ( ) static libraries
    • ( ) VS 2008 tools
    • ...
What do you think about? Did I forget something?

It looks like to match splits for me ;-) By my needs are different than your ;-)
 
 
If I find some time, I will get my hands dirty and try to modify the template.
 
I would do it myself, but I have absolutely no idea how the windows installer is built. Maybe it would be good to add the installer templates to the repository, so that they are versioned too and can be modified by everyone.
 
Commited as r815. 
 
Great! 
 
Some really minor nitpicks concerning the installer:
  • I find the behavior of the feature checkboxes in the installer really counterintuitive. Normally you get the description by clicking on a feature, not by hovering over. But here clicking on the text toggles the selection.
  • The installer succedes even if there are errors, and your really don't know what actually succeeded and what not.
  • No repair feature, just uninstall and install.
  • Executable instead of MSI
All of those points are probably because of the use of NSIS, which i'm not so fond of.
However, I must acknowledge that you probably don't want to invest much time learning WiX. And I'm not a WiX expert neither, so NSIS is probably the best solution currently.
 
The reason I tell you those points in the first place is, that with tesseract getting better and easier to use, it will attract professional users (Professional in the sense of money, not competence). A standard MSI just looks more professional than a custom installer. And you can use the scripting fuctionality of msiexec.
 
Tobi

Feel free to modify it (or bring something better). I just did it because there was nobody else ;-). I do not like MSI because (usually?) sw installed with MSI require MSI file for uninstall (maybe this is problem of packager but I hate this behavior because I fight for free space).

Here are some of my comments/explanation that can give you some light for current behavior:
  • I wanted to use else than NSIS, but I come back to NSIS for 3.02 release ;-). My criteria for installer:
    1. are free software, so anybody can check/improve my work
    2. it should be able to download packages (e.g. installer has only needed parts) through proxy server with authorization
    3. it should be uninstall software without needing installer
    4. it should be able to use gzip and zip archives or run external program
    5. it should be able to compress installer efficiently (I love lzma compression in NSIS) 
  • I wanted to use official packages: I did not want to split leptonica library and uploaded it to tesseract-ocr project.
  • I wanted to include only "must have files" (from my point of view) - other files should be possible to download
  • I don't wanted to create a lot of packages (for downloading).
I have possibility to test it on Windows XP installations with power-user rights only, so other combination could cause unexpected behavior.

Zdenko

troplin

unread,
Jan 7, 2013, 3:54:04 AM1/7/13
to tesser...@googlegroups.com

Am Sonntag, 6. Januar 2013 19:42:28 UTC+1 schrieb Zdenko Podobný:
On Thu, Jan 3, 2013 at 9:35 AM, troplin <tro...@gmail.com> wrote:
Am Donnerstag, 27. Dezember 2012 20:56:35 UTC+1 schrieb Zdenko Podobný:
On Thu, Dec 27, 2012 at 12:02 PM, troplin <tro...@gmail.com> wrote:
Sorry to be annoying but is there any chance that we could include the Windows DLL in the installer? Maybe in the next update? Or a minor patch to 3.02?
 
Can you be please more specific what do you mean by "include the Windows DLL in the installer"? Are there any problem with installling tesseract libraries by installer?
 
Oh, I didn't notice that.
Previously, the answer seemed to be 'no', so I didn't even try.
But actually it doesn't work, I'm getting an error 404 when it tries to download the libtesseract libary. liblept works however.
 
Also, when installing all language data, I'm getting some additional errors (some 404 and sometimes it tries to overwrite existing files).
 
Also I think the installer is mainly designed for developers instead of users. I would group the contents in the following way:
 
(x) Install by default
(-) Mixed content
( ) Don't install by default
  • (-) Components required for normal use
    • (x) Executable
    • (x) Release-DLLs without headers. (libtesseract302.dll, liblept168.dll) Only one bullet point, since libtesseract does not work without liblept
I am not sure if this is need for as default - tesseract is linked statically (because of training programs - search archive for reason) So common user do not need it.
 
In my opinion it is even more important than including the tesseract.exe executable.
An OCR engine by itself is certainly not bad if you don't use it very frequently. But it gets only really useful if it is integrated in a bigger context.
For example a nice GUI, an automatic fax inbox process, a batch scanning process etc. Most of the times, using the commandline tool for this applications is not the first choice and the API is much more flexible. You can use the 'native' OCR format of the whole process and don't have to parse and convert the hocr format. You don't have the initialization cost for every document, you can batch process multiple pages etc.
 
Of course, as the developer of such a tool, I install the developer tools, but as a user, It feels a bit strange if I have to install developer components just to run the program.
Completely including tesseract in the application is also not a solution, because the installation (possibly) includes so many files (language files). This is also the point where OCR-engines are a bit different from "conventional" libaries. The installation is big and is optimally shared between tools. 
 
My intent is, to make it as easy as possible for users to install tesseract, regardless if they use it directly or via another tool.
 
Just as a sidenote, we also provide integration for the ABBYY engine (probably the market leader in OCR solutions) and they also install the DLLs by default.
    • (-) Language data
      • (x) English
      • (x) OSD
      • ( ) Other languages
    • (x) Basic documentation for users
I forget to include manual pages (html files in doc directory). Maybe download of pdf documentation files (from svn repository) could be other option...
 
Personally, I would include the most important parts (everything that is selected by default) directly into the installer. Documentation is probably not that big anyway.
In any case, I would not load anything from svn directly. Better having a bigger but more reliable installer that doesn't need a network connection for the basic stuff.
  • ( ) Components required for advanced users (Everything required for customization of recognition)
    • ( ) Training tools
    • ( ) Tools for manipulating language data
    • ( ) Documentation for those tools
    • ...
  • ( ) Components required for developers that use tesseract in their products
    • ( ) Public tesseract and leptonica headers.
    • ( ) Debug DLLs
    • ( ) Tesseract and leptonica stub libraries (libtesseract302.lib, liblept168.lib). Those used for linking to the DLLs.
    • ( ) API documentation / doxygen
    • ...
  • ( ) Components required for tesseract developers
    • ( ) Tesseract source
    • ( ) static libraries
    • ( ) VS 2008 tools
    • ...
What do you think about? Did I forget something?
 
It looks like to match splits for me ;-) By my needs are different than your ;-)
 
Well I'm just trying to make the installer useful for the (in my opinion) most frequequent use cases.
Important points for me are:
  • It has to look trustworthy enough s.t. we can recommend it to our customers.
  • It has to include the most important components by default (most important for the users, not developers).
  • It has to be reliable and reproducible
    • The most important parts have to be included, not downloaded.
    • And if downloaded the source has to be "constant", not changing like SVN.
    • No 404 errors
I expect that a developer can handle some inconveniences without problems. But users are different. Everything that does not work with a next->next->next->next->finish install will generate support cases for us. 
 
If I find some time, I will get my hands dirty and try to modify the template.
 
I would do it myself, but I have absolutely no idea how the windows installer is built. Maybe it would be good to add the installer templates to the repository, so that they are versioned too and can be modified by everyone.
 
Commited as r815. 
 
Great! 
 
Some really minor nitpicks concerning the installer:
  • I find the behavior of the feature checkboxes in the installer really counterintuitive. Normally you get the description by clicking on a feature, not by hovering over. But here clicking on the text toggles the selection.
  • The installer succedes even if there are errors, and your really don't know what actually succeeded and what not.
  • No repair feature, just uninstall and install.
  • Executable instead of MSI
All of those points are probably because of the use of NSIS, which i'm not so fond of.
However, I must acknowledge that you probably don't want to invest much time learning WiX. And I'm not a WiX expert neither, so NSIS is probably the best solution currently.
 
The reason I tell you those points in the first place is, that with tesseract getting better and easier to use, it will attract professional users (Professional in the sense of money, not competence). A standard MSI just looks more professional than a custom installer. And you can use the scripting fuctionality of msiexec.
 
Tobi

Feel free to modify it (or bring something better). I just did it because there was nobody else ;-). I do not like MSI because (usually?) sw installed with MSI require MSI file for uninstall (maybe this is problem of packager but I hate this behavior because I fight for free space).
 
Don't get me wrong, I really appreciate your efforts!
 
Our own MSIs don't have to be present for uninstalling our software. However, we use InstallShield for creating the package, and that's not an option for tesseract because it's not free. I don't know how InstallShield does make it work.
 
I think the recommended way from Microsoft is WiX. (By the way, WiX is the first open source project ever from Microsoft.) However, I heard that WiX is a bit more complex to understand.
 
Some of my coworkers use WiX for their installers, probably I will look into that at some point and if it's easy I'll give it a try.
 
Here are some of my comments/explanation that can give you some light for current behavior:
  • I wanted to use else than NSIS, but I come back to NSIS for 3.02 release ;-). My criteria for installer:
    1. are free software, so anybody can check/improve my work
    2. it should be able to download packages (e.g. installer has only needed parts) through proxy server with authorization
Personally, I'm not so fond of that. I would prefer a bigger but more reliable installer. 
    1. it should be uninstall software without needing installer
    2. it should be able to use gzip and zip archives or run external program
    3. it should be able to compress installer efficiently (I love lzma compression in NSIS) 
  • I wanted to use official packages: I did not want to split leptonica library and uploaded it to tesseract-ocr project.
  • I wanted to include only "must have files" (from my point of view) - other files should be possible to download
  • I don't wanted to create a lot of packages (for downloading).
I agree on most points, and I also think the decision for NSIS is currently the right one. 
 
I have possibility to test it on Windows XP installations with power-user rights only, so other combination could cause unexpected behavior.
 
I have some more platforms where I can test the installer, that should not be a problem.
 
Tobi

TP

unread,
Jan 7, 2013, 6:33:13 AM1/7/13
to tesser...@googlegroups.com
On Mon, Jan 7, 2013 at 12:54 AM, troplin <tro...@gmail.com> wrote:
> Just as a sidenote, we also provide integration for the ABBYY engine
> (probably the market leader in OCR solutions) and they also install the DLLs
> by default.

Just curious. Are you saying it's possible to use the ABBYY FineReader
engine just by having your customers install FineReader? I thought a
developer had to buy the FineReader OCR SDK [1] (presumably very
expensive) to do something like that.

[1] http://www.abbyy.com/ocr_sdk/

troplin

unread,
Jan 8, 2013, 2:38:13 AM1/8/13
to tesser...@googlegroups.com
That's exactly the point. We buy the expensive SDK (Software Development Kit) and integrate ABBYY in our software. The SDK is just for the development.
Our customers however don't have to develop anything directly with the ABBYY software, they just have to buy a normal runtime license.
 
That's why I defined 3 target groups:
  1. Developers (of the OCR engine) need the source code.
  2. Integrators (like me) need the SDK.
  3. Users (like our customers) need the RTK (runtime kit).
They all have different needs, and I consider (3) the most important group when talking about the installer. The users are also the least experienced group, so the installer has to be easy to use.
 
PS: Actually, now that you mention it I'm not so sure anymore what installer exactly we provide to our customers. It may be that this is a special installer just for that use case, but this doesn't change the general idea.
 
Tobi
Reply all
Reply to author
Forward
0 new messages