Hosting pre-compiled Tesseract libraries for different platforms

2,351 views
Skip to first unread message

Max Pole

unread,
Jul 24, 2013, 11:40:41 AM7/24/13
to tesser...@googlegroups.com
Hello crews,

our open-source optical music recognition software "Audiveris" (https://audiveris.kenai.com/) utilizes Tesseract OCR for processing textual glyphs. Because Audiveris itself is written in Java I created a JNI wrapper that makes calls to the Tesseract dll.

We're currently trying to package Audiveris for Windows, Linux and Mac OS X. This requires us to provide several binary packages for Tesseract including different OSes and architectures. I'm about to publish the source code of my JNI wrapper which requires pre-compiled Tesseract dlls. This JNI wrapper, without appropriate binary packages, is of little avail.

I've recently noticed that Tesseract project already provides a binary package for 32bit Windows (tesseract-3.02.02-win32-lib-include-dirs.zip).

Is it conceivable to host binary packages for some other popular OSes and architectures as well? Especially, it's rather difficult to compile Tesseract on Windows and Mac OS X yourself because:

1) one need to obtain non-free development tools

2) compilation and installation on these OSes is much more complicated than doing "configure && make" on Unix-based systems


Below two examples:

1a) Building Tesseract for Windows x64 requires the FULL version of M$ VisualStudio because the express version doesn't allow to build 64bit binaries. Moreover, compiling libtiff requires accessing the 64bit compiler from the command line which isn't available in the free version by default.

1b) Currently, the only way to get Tesseract dlls on OS X is to compile those with Unix dependencies as provided with Homebrew, Macports etc. Deploying binary libraries created that way requires to install a galore of additional non-native components whose partly duplicate Mac OS' own functionality. Moreover, the native OS X way of distributing shared components via so-called "frameworks" cannot be achieved using the current Tesseract build scripts.

One need to set up a special project for building Tesseract within the OS X development environment (XCode).

Someone has asked the question if the 64bit Windows version is required at all. Why don't simply use the 32bit one? One big reason for having the 64bit version as well would be the fact that 64bit Java VM cannot execute 32bit dlls via JNI and vice versa.

I've managed to built Tesseract on both Windows and OS X (all architectures). I'd love to provide the Tesseract project with appropriate binaries and project files.

I would like to see the following binary packages for Tesseract available in the "Download" area:
- Windows x86 and x64 with runtime dependencies
- Mac OS X x64
- developer packages containing VC and XCode project files and detailed instructions on how to build Tesseract and all its dependencies

Best regards
Max Poliakovski

Nick White

unread,
Jul 24, 2013, 12:04:44 PM7/24/13
to tesser...@googlegroups.com
Hi Max,

Thanks for writing this, your JNI wrapper sounds like a useful thing
indeed, and I'm glad Tesseract is proving useful for your software.

Regarding 64 bit Windows builds, can mingw do the job? I know it can
be used to compile Tesseract on Windows, but I don't know mingw
well enough to know if it's easy to use it to build 64 bit versions
(it ought to be).

As for the OS X issue, we would surely welcome a xcode project file
(or whatever they use), to help people who want to build it that
way. The issue is purely that nobody has stepped up to do it.

I doubt there would be any resistance to having more pre-compiled
shared libraries on the download page, the issue is only that we
lack the people with enough interest and knowledge in these
platforms contributing. And compiling and testing new binaries in
the future.

In the short term, is there any barrier to distributing the shared
libraries for OS X created by homebrew? Is it only that there's
duplicated functionality (and extra library files needed) due to it
using unix dependencies? Or would the only reasonable way to do it
be using xcode so that whatever this 'framework' distribution method
is works properly?

We should just solve this problem once and for all by just forcing
everyone to use Debian ;p

Nick

Max Pole

unread,
Jul 24, 2013, 12:48:13 PM7/24/13
to tesser...@googlegroups.com
Thank you for your reply, Nick!




Regarding 64 bit Windows builds, can mingw do the job? I know it can
be used to compile Tesseract on Windows, but I don't know mingw
well enough to know if it's easy to use it to build 64 bit versions
(it ought to be).

I've tried to compile Tesseract under MingW one year ago and failed. Despite some heavy patching to get things built at all I wasn't able to produce any working shared library. But I could give another try...


I doubt there would be any resistance to having more pre-compiled
shared libraries on the download page, the issue is only that we
lack the people with enough interest and knowledge in these
platforms contributing. And compiling and testing new binaries in
the future.

I would do this job...
 

In the short term, is there any barrier to distributing the shared
libraries for OS X created by homebrew? Is it only that there's
duplicated functionality (and extra library files needed) due to it
using unix dependencies? Or would the only reasonable way to do it
be using xcode so that whatever this 'framework' distribution method
is works properly?

OS X frameworks offer several advantages over stand-alone libraries:
- linking is much simplier because frameworks usually keep code, headers and resources in one place => framework bundle
- installation and removal is much simplier

Frameworks can be installed and used very flexibly. Global frameworks will be accessible for all applications. A framework can be easily bundled with applications (so-called private framework) so the installation and removal of such an application is simply a matter of drag&drop...

In the case of stand-alone libraries one need to create and distribute stand-alone installers. BTW, packaging tesseract compiled with Macports produces an installer with the size of 80MB on my system. A little to much, isn't that?
 

We should just solve this problem once and for all by just forcing
everyone to use Debian ;p


Great idea indeed!

Best regards
Max

Nick White

unread,
Jul 24, 2013, 1:04:42 PM7/24/13
to tesser...@googlegroups.com
On Wed, Jul 24, 2013 at 09:48:13AM -0700, Max Pole wrote:
> I've tried to compile Tesseract under MingW one year ago and failed. Despite
> some heavy patching to get things built at all I wasn't able to produce any
> working shared library. But I could give another try...

Zdenko compiled some instructions that ought to work:
http://www.sk-spell.sk.cx/compiling-leptonica-and-tesseract-ocr-with-mingwmsys
(they're linked to from the "Compiling" wiki page now, thankfully)

> I doubt there would be any resistance to having more pre-compiled
> shared libraries on the download page, the issue is only that we
> lack the people with enough interest and knowledge in these
> platforms contributing. And compiling and testing new binaries in
> the future.
>
> I would do this job...

I hoped you might say that :) We'll have to wait for someone with
appropriate admin rights to say OK.

> In the short term, is there any barrier to distributing the shared
> libraries for OS X created by homebrew? Is it only that there's
> duplicated functionality (and extra library files needed) due to it
> using unix dependencies? Or would the only reasonable way to do it
> be using xcode so that whatever this 'framework' distribution method
> is works properly?
>
> OS X frameworks offer several advantages over stand-alone libraries:
> - linking is much simplier because frameworks usually keep code, headers and
> resources in one place => framework bundle
> - installation and removal is much simplier
>
> Frameworks can be installed and used very flexibly. Global frameworks will be
> accessible for all applications. A framework can be easily bundled with
> applications (so-called private framework) so the installation and removal of
> such an application is simply a matter of drag&drop...
>
> In the case of stand-alone libraries one need to create and distribute
> stand-alone installers. BTW, packaging tesseract compiled with Macports
> produces an installer with the size of 80MB on my system. A little to much,
> isn't that?

80MB is rather huge, indeed! It sounds like the frameworks system is
the way to go then, so long as someone can maintain it.

If we (by which I mean you) got it building in XCode, would that
also make it easy for us to release a .app bundle thing for
Tesseract as well? That would be excellent.

Nick

Jimmy O'Regan

unread,
Jul 24, 2013, 1:17:54 PM7/24/13
to tesser...@googlegroups.com
On 24 July 2013 18:04, Nick White <nick....@durham.ac.uk> wrote:
> On Wed, Jul 24, 2013 at 09:48:13AM -0700, Max Pole wrote:
>> I've tried to compile Tesseract under MingW one year ago and failed. Despite
>> some heavy patching to get things built at all I wasn't able to produce any
>> working shared library. But I could give another try...
>
> Zdenko compiled some instructions that ought to work:
> http://www.sk-spell.sk.cx/compiling-leptonica-and-tesseract-ocr-with-mingwmsys
> (they're linked to from the "Compiling" wiki page now, thankfully)
>
>> I doubt there would be any resistance to having more pre-compiled
>> shared libraries on the download page, the issue is only that we
>> lack the people with enough interest and knowledge in these
>> platforms contributing. And compiling and testing new binaries in
>> the future.
>>
>> I would do this job...
>
> I hoped you might say that :) We'll have to wait for someone with
> appropriate admin rights to say OK.
>

Ok. Like you said, it's more about not having people to do the
compiling than about not wanting to distribute binaries.

Please note, though, that after January the downloads section is going
away: http://thenextweb.com/google/2013/05/22/google-codes-download-option-deprecated-due-to-misuse-only-existing-project-downloads-to-be-kept-after-january-15/

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Max Pole

unread,
Jul 24, 2013, 1:51:47 PM7/24/13
to tesser...@googlegroups.com
Thank you for pointing that out! Anyway, we have some time for putting binaries into the old download section...

Best regards
Max

Tom Morris

unread,
Jul 24, 2013, 1:53:50 PM7/24/13
to tesser...@googlegroups.com
On Wed, Jul 24, 2013 at 1:17 PM, Jimmy O'Regan <jor...@gmail.com> wrote:
Interestingly Github just brought theirs back after killing off the previous incarnation 6 months ago.


Tom 

zdenko podobny

unread,
Jul 24, 2013, 3:43:57 PM7/24/13
to tesser...@googlegroups.com
If I got it right - it is was releasing source code and not binaries, so it will not help to distribute e.g. windows binaries.

Zdenko

Tom Morris

unread,
Jul 24, 2013, 3:55:47 PM7/24/13
to tesser...@googlegroups.com
You can add binaries to the releases.  I just finished moving mine from Google Code to Github.

Scroll to the bottom of this page to see how they're presented:

In the editing pane for the release notes, there's a drag-and-drop area that you can just drop the binaries on to upload them.

Tom



--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

zdenko podobny

unread,
Jul 24, 2013, 4:15:03 PM7/24/13
to tesser...@googlegroups.com
It looks like that from longterm point of view we should look for other option. Google suggests to use their Drive[1]. For the moment I think it make sense to create "sub-projects" for binary distribution and provide links to wiki... IMO also language data files could be move to other repository...


zdenko podobny

unread,
Jul 24, 2013, 4:18:47 PM7/24/13
to tesser...@googlegroups.com
On Wed, Jul 24, 2013 at 9:55 PM, Tom Morris <tfmo...@gmail.com> wrote:
You can add binaries to the releases.  I just finished moving mine from Google Code to Github.

Scroll to the bottom of this page to see how they're presented:

In the editing pane for the release notes, there's a drag-and-drop area that you can just drop the binaries on to upload them.

Tom


Great ;-) Few days ago I started to move some things to sourceforge.net...

Zdenko

Nick White

unread,
Aug 30, 2013, 7:44:17 AM8/30/13
to tesser...@googlegroups.com
Hi Max,

Have you made any progress on getting Tesseract working in XCode as
a framework? I'd half forgotten about this, then was explaining to
someone how to install Tesseract on OS X, and realised how nice it
would be to have a more natural solution than Homebrew :)

Thanks, and I hope you're well,

Nick

Max Pole

unread,
Sep 26, 2013, 5:50:04 AM9/26/13
to tesser...@googlegroups.com


Am Freitag, 30. August 2013 13:44:17 UTC+2 schrieb Nick White:


Have you made any progress on getting Tesseract working in XCode as
a framework? I'd half forgotten about this, then was explaining to
someone how to install Tesseract on OS X, and realised how nice it
would be to have a more natural solution than Homebrew :)

Sorry for answering you so late!

Yes, Tesseract is working on OS X as native framework. I have set up a quick XCode project for compiling it this way.
I'm not sure if someone will able to use it as easy as under Linux/Windows because the way Tesseract data files are being accessed is too much Unix-Windows-like - it mixes read-only (language files) and writable data (settings) together in one folder.

Ideally, the language data should be kept within the framework as read-only resources. They could be easily accessed and updated this way so there is no need for TESSDATA_PREFIX and environment variables.
On the other hand, settings should be kept in a writable location - OS X offers several ways for storing and managing application settings.

All this would require platform-specific changes of the Tesseract code but it's doable. The question is if we want to do that...

Best regards
Maxim

Nick White

unread,
Sep 26, 2013, 7:47:50 AM9/26/13
to tesser...@googlegroups.com
> Sorry for answering you so late!

No problem, I'm just happy you're answering :) My replies are below.

> Yes, Tesseract is working on OS X as native framework. I have set up a quick
> XCode project for compiling it this way.
> I'm not sure if someone will able to use it as easy as under Linux/Windows
> because the way Tesseract data files are being accessed is too much
> Unix-Windows-like - it mixes read-only (language files) and writable data
> (settings) together in one folder.
>
> Ideally, the language data should be kept within the framework as read-only
> resources. They could be easily accessed and updated this way so there is no
> need for TESSDATA_PREFIX and environment variables.
> On the other hand, settings should be kept in a writable location - OS X offers
> several ways for storing and managing application settings.

What writable settings files are you referring to? Tesseract doesn't
have any settings files as such.

> All this would require platform-specific changes of the Tesseract code but it's
> doable. The question is if we want to do that...

I doubt there would be any opposition, but I don't understand which
files you would want to separate out and make writable.

Another question, how do you envision users adding extra language
trainings to their systems? I don't know how OS X prefers to handle
such things.

Thanks, and I'm very happy you're still thinking about this!

Nick

Max Pole

unread,
Sep 26, 2013, 9:02:29 AM9/26/13
to tesser...@googlegroups.com


Am Donnerstag, 26. September 2013 13:47:50 UTC+2 schrieb Nick White:
> Sorry for answering you so late!

No problem, I'm just happy you're answering :) My replies are below.

> Yes, Tesseract is working on OS X as native framework. I have set up a quick
> XCode project for compiling it this way.
> I'm not sure if someone will able to use it as easy as under Linux/Windows
> because the way Tesseract data files are being accessed is too much
> Unix-Windows-like - it mixes read-only (language files) and writable data
> (settings) together in one folder.
>
> Ideally, the language data should be kept within the framework as read-only
> resources. They could be easily accessed and updated this way so there is no
> need for TESSDATA_PREFIX and environment variables.
> On the other hand, settings should be kept in a writable location - OS X offers
> several ways for storing and managing application settings.

What writable settings files are you referring to? Tesseract doesn't
have any settings files as such.


I'm referring to configuration files located in tessdata/configs. Are those read-only? No problem - this would even simplify the situation :))

 

> All this would require platform-specific changes of the Tesseract code but it's
> doable. The question is if we want to do that...

I doubt there would be any opposition, but I don't understand which
files you would want to separate out and make writable.

Another question, how do you envision users adding extra language
trainings to their systems? I don't know how OS X prefers to handle
such things.


The OS X framework is just a special-purpose folder (called "bundle" in the OS X terminology) encapsulating executable code and data organized as resources. The code can easily access the encapsulated resources via a special-purpose API of the OS.
These bundles can be easily moved, copied etc.

Users can add language files by copying them into the framework bundle. That's all.

The only change required is to replace the Tesseract code responsible for locating tessdata folder with several simple OSX calls in order to get access to the language files (and perhaps the configs) inside the bundle. This should be surely done only for OS X targets and can be achieved by adding some #ifdefs...

Best regards
Maxim

Nick White

unread,
Sep 26, 2013, 9:41:52 AM9/26/13
to tesser...@googlegroups.com
On Thu, Sep 26, 2013 at 06:02:29AM -0700, Max Pole wrote:
> I'm referring to configuration files located in tessdata/configs. Are those
> read-only? No problem - this would even simplify the situation :))

Aah, I forgot about those :p Oops.

Well arguably now that it is possible to pass multiple '-c configvar=value'
type arguments directly to Tesseract the ability to modify or add to
those configuration files is less important. So I'd be inclined to
treat them as read-only, yes.

> Users can add language files by copying them into the framework bundle. That's
> all.

So does that mean each language file can literally be dragged onto
the Tesseract bundle icon, and (once you've added a little magic
glue code) it will then be copied to the appropriate place inside
the bundle, and then just work? That would be awesome :)

> The only change required is to replace the Tesseract code responsible for
> locating tessdata folder with several simple OSX calls in order to get access
> to the language files (and perhaps the configs) inside the bundle. This should
> be surely done only for OS X targets and can be achieved by adding some #
> ifdefs...

That sounds totally reasonable and good to me.

Nick

Max Pole

unread,
Sep 26, 2013, 10:25:47 AM9/26/13
to tesser...@googlegroups.com

[...]


> Users can add language files by copying them into the framework bundle. That's
> all.

So does that mean each language file can literally be dragged onto
the Tesseract bundle icon, and (once you've added a little magic
glue code) it will then be copied to the appropriate place inside
the bundle, and then just work? That would be awesome :)

Yes, that would be awesome indeed!

In the reality, one just need to drag a language file into one of the dedicated subfolders of the bundle, i.e. Tesseract.framework/Contents/Resources/
There is no need for "magic" glue code...

Best regards
Maxim

Berkeley Malagon

unread,
Nov 10, 2013, 6:40:58 PM11/10/13
to tesser...@googlegroups.com
Hi Max,

Thanks for doing this helpful work. Is there any update? I'm eager to download your framework to use in my own projects.

Let me know if I can help at all (testing, etc).

Berkeley

Nick White

unread,
Apr 16, 2014, 10:16:03 PM4/16/14
to tesser...@googlegroups.com
Hi Max,

It would be awesome if we could get your OS X app bundle thing
working for the 3.03 release soon. Do you have any update on it? Or
can you share the work you've done so far, with a description of
what's left to do?

That'd be excellent, as we're rather lacking in OS X experts here ;)

Thanks,

Nick

Erik Hejl

unread,
Apr 30, 2014, 9:22:30 PM4/30/14
to tesser...@googlegroups.com
Add me to the list of people interested in OS X support.  I've been trying to use the Tesseract API with Mono which leaves me needing 32 bit versions of the dylibs.  Have you been able to get 32 bit builds to work on OS X?  I've been able to build on my mac using homebrew to get the Unix dependencies, but so far I've only been able to get them in 64 bit form.  The app bundle idea you've pitched sounds good too, but the dylibs would be handier for Mono since they parallel SO and DLL's for Linux and Windows.

Max Pole

unread,
May 5, 2014, 8:52:21 AM5/5/14
to tesser...@googlegroups.com
Hello Nick,


Hi Max,

It would be awesome if we could get your OS X app bundle thing
working for the 3.03 release soon. Do you have any update on it?

Yes, I've recently tried to compile the latest trunk in XCode, unfortunately without success. I've just submitted my problem to the issue tracker: issue 1150
3.02 compiles and works as expected, so there is a chance to get 3.03 working as well. It may require several patches though...

I'm concentrating on getting OS X framework aka OS X shared library working first. This would make Tesseract available for several 3rd party projects via TessBaseAPI.
The app bundle aka stand-alone recognition application would come next.

I still have a few open questions:

* who is the copyright holder of tesseract? I need to know that in order to place the right copyright string in the XCode project.
* is that "graphical" stuff (scrollview etc.) needed for TessBaseAPI? Is it safe to just disable it with "--disable-graphics"?

Thank you in advance for answering my questions.

Best regards
Max

Max Pole

unread,
May 5, 2014, 9:03:31 AM5/5/14
to tesser...@googlegroups.com
Hi Erik,


Add me to the list of people interested in OS X support.  I've been trying to use the Tesseract API with Mono which leaves me needing 32 bit versions of the dylibs.

Does Mono on Mac only accept 32 bit libraries? It's ridiculous!

 
Have you been able to get 32 bit builds to work on OS X?

I just tried to change the architecture of the target from "x86_64" to "i386" and recompile the whole project. It succeed. I don't have any clue if it really works...

I could share my XCode projects for Tesseract 3.02 with you, so you can conduct your own experiments. The recent version doesn't build though (see my previous answer). Please contact me at maximumspatium at googlemail dot com.

Best regards
Max

Nick White

unread,
May 5, 2014, 9:39:06 AM5/5/14
to tesser...@googlegroups.com
Hi Max,

It's great to hear updates, I'm so pleased you're continuing to work
on this!

On Mon, May 05, 2014 at 05:52:21AM -0700, Max Pole wrote:
> Yes, I've recently tried to compile the latest trunk in XCode, unfortunately
> without success. I've just submitted my problem to the issue tracker: issue
> 1150
> 3.02 compiles and works as expected, so there is a chance to get 3.03 working
> as well. It may require several patches though...

Ah, fmemopen, thanks for the prod about that. I know Ray likes the
idea of changing that code, so an extra reason for doing so will be
well received ;)

> I'm concentrating on getting OS X framework aka OS X shared library working
> first. This would make Tesseract available for several 3rd party projects via
> TessBaseAPI.
> The app bundle aka stand-alone recognition application would come next.

Sounds good to me.

> I still have a few open questions:
>
> * who is the copyright holder of tesseract? I need to know that in order to
> place the right copyright string in the XCode project.

Free software projects generally just put something like "The
Tesseract team" in copyright strings like that. Ray Smith is the
lead developer, but the copyright is held by the contributors, which
is quite a few people (see AUTHORS in SVN, plus there are patches
that have gone through the issue tracker, and from Zdenko, that
aren't reflected in that file). So unless anyone here objects, go
for "the Tesseract team".

> * is that "graphical" stuff (scrollview etc.) needed for TessBaseAPI? Is it
> safe to just disable it with "--disable-graphics"?

It should be safe to disable it, yes. That's just the scrollview
stuff.

Nick

Zdenko Podobný

unread,
May 5, 2014, 4:17:16 PM5/5/14
to tesser...@googlegroups.com
On Monday, 5 May 2014 14:52:21 UTC+2, Max Pole wrote:
Hello Nick,

Hi Max,

It would be awesome if we could get your OS X app bundle thing
working for the 3.03 release soon. Do you have any update on it?

Yes, I've recently tried to compile the latest trunk in XCode, unfortunately without success. I've just submitted my problem to the issue tracker: issue 1150
3.02 compiles and works as expected, so there is a chance to get 3.03 working as well. It may require several patches though...

imagedata  is not used yet in tesseract, so files can be skipped (Just remove them from ccstruct/Makefile.am and run ./autogen.sh). I am not sure if the code will be used in 3.03 (debian/ubuntu already shipped some version of 3.03.02 ;-) ) or in 3.04/4.00 . Maybe Ray can give more light in this.

I'm concentrating on getting OS X framework aka OS X shared library working first. This would make Tesseract available for several 3rd party projects via TessBaseAPI.
The app bundle aka stand-alone recognition application would come next.

I still have a few open questions:

* who is the copyright holder of tesseract? I need to know that in order to place the right copyright string in the XCode project.
IMO current copyright holder is Google Inc. Copyright holder is mentioned in each file header (see e.g. [1]). Old code states as holder "Hewlett-Packard Ltd.".

 
* is that "graphical" stuff (scrollview etc.) needed for TessBaseAPI? Is it safe to just disable it with "--disable-graphics"?
Yes and no ;-). "ordinary users" will maybe not needed. But if somebody would like to see what is happening in tesseract ([2], [3]) they can not do it without scrollview. So if this library will be used for you/your project only - it is safe. If you would like to offer it to other project/developer I would say no. Personally I do not like if packager decide I do not need some feature...

Ray Smith

unread,
May 18, 2014, 6:02:32 PM5/18/14
to tesser...@googlegroups.com
I think Zdenko and Nick have already covered most of the questions, but here are a few confirmations:

Copyright is indicated by the comments at the top of the files, but others have contributed as well, so maybe for generality, you should use "Hewlett Packard, Google, and other members of the Tesseract Team".

I can confirm that imagedata.cpp (and its accompanying fmemopen) will NOT be used by 3.03. ImageData will be used by the next version, but by then I expect to have removed the use of fmemopen.

I agree with Zdenko on the graphics stuff. Most ordinary users won't need it, but it would be a shame to do without it.
I thought it compiles on Mac since there are some #ifdef __apple__ in there, or is that just for a newer Mac OS?

I have some more changes to submit this week, and I haven't done the retraining tests yet.
There is also a lot of training data still coming.


Reply all
Reply to author
Forward
0 new messages