Ray need our developers/users a hand on voting Leptonica Usage, create a Windows Installer and Wiki for all Tesseract add-ons

6,751 views
Skip to first unread message

Tien Dung

unread,
Nov 13, 2008, 7:28:15 PM11/13/08
to tesser...@googlegroups.com
Hi all,

I open a new thread for Ray's answer of my last question about Tesseract 3.00. I asked Ray when the 3.00 will be release and  how can users and developers like us can help to make it happen sooner. He told that he need our input on the leptopnica usage and it will be nice it we can create wiki for all Tesseract add-ons and create a windows installer for Tesseract and dependency libs (like libtiff) for windows users.

From mailing list, I saw a lot of interest from people who want to use Tesseract as an OCR-engine for their mother tongue language. As I understand, currently Tesseract is the best open source OCR-engine. But it still need HELP from community to make it BETTER and EASIER to use.

We always want a BETTER and EASIER Tesseract at a SHORTER time.

So please give Ray a hand to help him on following issues and event more interesting projects to improve Tesseract for would-be contributors at http://code.google.com/p/tesseract-ocr/wiki/TesseractProjects.

Here is my answer for Ray questions:

With or without Leptopnica?


I vote for Leptopnica because it give (1) better performance, (2) simplification code, and especially (3) reading many more image format.

I see no reason not to use other open source libs. It helps Tesseract concentrate on what it does BEST: a FAST and ACCURATE template matching OCR-engine. Let use as many image reading libs, image processing libs, page analysis libs and you can so you have time to improve OCR-engine to make it AS GOOD AS or event BETTER current commercial OCR-engine again. Tesseract did in in the past. Tesseract can do it again with a lot of love and help from community.

For Windows users, we can create a Windows installer to include all Tesseract dependency libs just like GTK installer when we need to intall GIMP on Windows.

Areas where the developer/user community can help:
> A collation of all the add-ons (box editors, c#/.net extensions, Java, apps built on top etc) and added to the wiki would be really helpful

Any author of those add-ons in the mailling list. Please help Ray to collect useful information about your valuable add-ons. User will get a lot benefit from your help because they open ask each other about how to train Tesseract for a new language? How can I use Tesseract with other programming language? It's also help to drive users to your add-on.

> A windows installer (see above) would be useful
Anyone who already wrap Tesseract as a Windows installer?

Best regards,

Tien Dung


I am hoping to push 3.00 out in the first quarter next year.

One issue that I am concerned about (and it is why I intend to be sure 2.04 is solid and includes as many patches as possible) is whether or not to make 3.00 dependent on leptonica. For certain, some of the features will only be available if you have it, but it is likelythat 3.00 will still build and run without it. At some point in the future though that may not be able to continue.

Here are some thoughts, and I would like to get input from the developer/user community on this issue:

For leptonica:
  • Some features will depend on it. To get best performance you will need it.
  • It could allow simplification of the code, and elimination of the old IMAGE class.
  • It will allow reading of many more image formats, which a lot of users have requested.
  • It might be easier if the default windows project files assume that you have leptonica. That would make it easier to build with it, and it would only be a case of downloading it.

Against making tesseract dependent on leptonica:
  • It will require several additional components: leptonica, libtiff, libjpg, libpng, which would bloat the executable, and many (windows) users have refused to even download libtiff.
  • Installation and build support will become much more effort. (Mostly for windows) If somebody could write a windows installer for it (open source of course), then that would simplify installation a lot for the windows user-only community.
Areas where the developer/user community can help:
  • A collation of all the add-ons (box editors, c#/.net extensions, Java, apps built on top etc) and added to the wiki would be really helpful
  • A windows installer (see above) would be useful

On Wed, Nov 12, 2008 at 7:50 PM, Tien Dung <dun...@gmail.com> wrote:
Hi Ray,

There are a lot of sweet features in 3.00 release: thread-safe, patches, better modular and API ...

I would like to ask when will the 3.00 version release?

And how can users and developers like us can help to make it happen sooner :)

Best regards,

Tien Dung



lab

unread,
Nov 13, 2008, 9:00:28 PM11/13/08
to tesseract-ocr
I apologize for repeating this comment from the other thread, but I
think there is perhaps the proper place for it. I tend to disagree
with leptonica.

Have you considered instead simplifying the image format all the way?
I mean by this drop the tiff input and replace it with a trivial
encoding such as the Netpbm formats (http://netpbm.sourceforge.net/)
which are extremely portable.

The advantage is simplicity: a reader or writer for the black and
white format takes 5-10 lines of code to write from scratch in just
about any language, so anybody could interface with tesseract easily.
There would be no build dependencies, and no special handling of many
different file formats in the tesseract code, so you can concentrate
on the OCR.

The disadvantage would be that scanners perhaps don't write the pbm
format natively, so users would likely need to convert their images at
some point. Also, pbm files tend to get big, but that's easy to fix
with compression. Many free compression libraries have wrappers for
file handling routines which make reading and writing compressed files
transparent.

BTW, I've uploaded my box editor to the files area and added a comment
to the training instructions, but I am unclear what wiki page is
specifically intended for add-ons.

Julien Benoit

unread,
Nov 14, 2008, 3:25:42 AM11/14/08
to tesser...@googlegroups.com
Hello,


> A windows installer (see above) would be useful
Anyone who already wrap Tesseract as a Windows installer?


In addition of a windows installer, I think an official simple GUI would be very useful for users with basic skills who will not know how to use a command line application. This gui could include a format converter to bypass the problem of the uncompressed tiff input which seems to be blocking some users when then try to use tesseract.

It could also outputs other formats (PDF, RTF, ...)

I've written a small GUI in C# using tessnet2 (the .net port of tesseract). It has some code to output PDF files bases en coordinates of words found by tesseract and input image. The code generating pdf could be converted to c++.

In my opinion, a good official GUI and a windows installer would greatly increase the popularity of tesseract and it could be an area where developpers can help Ray.

--
Julien Benoit

rthomas

unread,
Nov 14, 2008, 4:40:14 AM11/14/08
to tesseract-ocr
Hi,

What is OCR? You have a 2 bit image and you try to get text from it.

From my point of view an OCR engine don't need image library and image
processing library.
Keep the code simple, let the developers bundle it with the image lib
he likes/wants to use (open source (libtiff), OS included
(gdiplus.dll) or commercial (LeadTools, Accusoft...))
The only image processing you can include is thresholding from 24 bit
image.

Today tesseract have 3 big problem :
- memory leak.
- too complex code.
- process oriented, it's not designed to be use as a lib (exit(), file
I/O...)

What we need, I'm sure, is a complete rewritting. Transform the 222
cpp file to less than 20.
A C++ lib should be OS independent, simply because you don't need OS
specific API (no I/O).

I think the correct direction is to
1) reverse engineer the code and document it
2) complete rewriting from the documentation

When you have a good OCR lib then you can bundle it for "public"
usage.
I spent a lot of time in tesseract code source and I don't want to
spend more time in it.
I'm ready to help for a complete rewriting.

Remi

Tessnet2 author
C++ dev since 1989
Windows platform expert (C++/C#)
Image processing expert
Freelance since 2001

rthomas

unread,
Nov 14, 2008, 4:41:59 AM11/14/08
to tesseract-ocr
> A C++ lib should be OS independent, simply because you don't need OS
> specific API (no I/O).
I mean a C++ OCR Lib

Tien Dung

unread,
Nov 15, 2008, 6:26:52 PM11/15/08
to tesser...@googlegroups.com
Hi all,

Remi Thomas thought that:

OCR engine don't need image library and image processing library. Keep the code simple, let the developers bundle it with the image lib he likes/wants to use. The only image processing you can include is thresholding from 24 bit image.

He pointed out that Tesseract have 3 big problem :

- memory leak.
- too complex code.
- process oriented, it's not designed to be use as a lib

Tesseract needs a complete rewriting.

The correct direction is to

1) reverse engineer the code and document it
2) complete rewriting from the documentation

And he is ready to help for a complete rewriting.

What do the others think?

Do you agree that Tesseract needs a complete rewriting?

Can you help in case Ray want to rewrite Tesseract?

Best regards,

--
Tien Dung
http://codemonkeycode.blogspot.com/

dsward

unread,
Nov 16, 2008, 5:29:08 PM11/16/08
to tesseract-ocr
We use Tesseract on Mac OS X, so one of our main concerns is
portability, and we hope that the Tesseract software will continue to
work on all platforms. We like C++ with the standard libraries as the
programming language. We would hope that things like C# are not used
in the core code base.

The Windows installer and GUIs can be developed as independent
projects. We would hope that the chief developer is not burdened with
those platform-specific projects, which would be an unnecessary
distraction from improving the core functionality of the system.
Instead, Tesseract should have a good general API to support GUIs on
any platform.

We strongly support TIFF image formats and tifflib. We're not opposed
to adding support for other formats, but TIFF should not be dropped.

Leptonica can be extremely useful in OCR systems. We use the
connected component analysis feature for finding blob boundaries.
Based on that experience, building Tesseract with Leptonica looks like
a good idea.

Finally, I believe that the system could benefit from a complete re-
write in a future version. The existing code works, but it's showing
its age.

- Doug Ward

Ray Smith

unread,
Nov 28, 2008, 1:54:01 PM11/28/08
to tesser...@googlegroups.com
An advantage of making Leptonica a definite dependency, instead of an optional add-on, is that it would enable the deletion of the existing Tesseract IMAGE class, thus providing simplification and one step towards a rewrite.

While a full rewrite would be beneficial, it is something that cannot be achieved without a lot of effort, and that would either take a carefully coordinated  effort from a large number of people, or a gradual change over a long time. Until I get a large army of very capable volunteers, it will have to be that latter approach.

BTW, I created an add-ons page in the wiki, and I see there is one comment there already, If you have an add-on, please add a comment there, and I, or one of the other developers will gather them up and put them into the page.

Ray.

shao.d

unread,
Nov 28, 2008, 9:26:16 PM11/28/08
to tesseract-ocr
As indicated in tesseract issue 174, I have completed porting
tesseract subversion revision 201 to use libtool to generate shared
libraries on platforms including Mac OS X, Ubuntu 8.10, and Cygwin
(and earlier versions built on FreeBSD 7.0 stable). In addition to
tesseract, I have completed patches for all of leptonlib-1.58, OpenFst
20080422, iulib subversion revision 117, and ocropus subversion
1307. The patches can be found in the ocropus file area
http://groups.google.com/group/ocropus/files
under the name shao.d with prefix "toautotools" and date suffix
"20081127.diff".

I know that tesseract is already available through various package
managers, including 2.01 from Macports on Mac OS X. But now I believe
there is a framework for all of these related packages to be even more
available, basically just a point and click away for Unix-like
platforms. Two legs of the GNU autotools were done, autoconf and
automake, and now the third leg of libtool for shared libraries is
done. In addition everything is now installable anywhere including a
subdirectory in one's home directory that does not require
administrator privileges to access. All a user has to do is set PATH
for executables, CPPFLAGS for headers, and LDFLAGS for libraries, and
compilation and installation just work with configure, make, and make
install. (There is a price that my patches require autoreconf and get
rid of existing Makefile.in's and Makefile's.)

Reply all
Reply to author
Forward
0 new messages