Contributing?

Vivian Landers

unread,

Jan 15, 2008, 4:19:08 PM1/15/08

to tesseract-ocr

Hi, I've studied OCR systems in grad school and I'm interested in
contributing to an open-source OCR project. Tesseract seems like one
of the best ones currently available, and I like its modular design.
Is there a to-do list anywhere of things that Tesseract still needs?
What research is it based on, and what existing research might
profitably be incorporated into it? What practical work needs to be
done on things like the build system, installer, code refactoring,
documentation, and so on? And how do I get permission to make
contributions? Thanks a lot for your help!

-Vivian

74yrs old

unread,

Jan 15, 2008, 9:38:32 PM1/15/08

to tesser...@googlegroups.com

you can contribute to Ocropus by way of developing windows platform for ocropus
also

Ted Rolle

unread,

Jan 16, 2008, 12:26:23 AM1/16/08

to tesser...@googlegroups.com

This may seem dirt-simple, but hang around the mailing list and see what people need.

Ted

Jussi Pakkanen

unread,

Jan 16, 2008, 5:17:16 AM1/16/08

to tesser...@googlegroups.com

On Jan 15, 2008 11:19 PM, Vivian Landers <vivian...@gmail.com> wrote:

> profitably be incorporated into it? What practical work needs to be
> done on things like the build system, installer, code refactoring,
> documentation, and so on?

Here are some things which I have considered while looking at the code:

Documenting how the character matcher actually works. AFAICR the
system does very convoluted things before calling the matcher. Stuff
like converting extracted TBLOBs to PBLOBs and back again and calling
matching functions indirectly through function pointers and so on.

Tesseract passes some state variables as function parameters but some
things are in global variables that are scattered all over the code
base. This makes Tesseract very non-threadsafe and difficult to
decipher. Wrapping all global variables in a struct and passing it
along with the functions or something like that would be beneficial.

Like all C programs from the early 90s, Tesseract has its own
implementation of standard containers such as a linked list. (In fact,
it has several implementations which all work slightly differently.)
These are done with preprocessor macros. See for example elst.h.
Converting to STL containers would be very beneficial.

Tesseract has two different build systems: autotools for Unix and
Visual studio project files for Windows. It would make more sense to
have just one build system using CMake, since it supports all these
platforms (plus a bunch of others such as XCode) natively.

> And how do I get permission to make contributions? Thanks a lot for your help!

Usually posting patches to the bug tracker or this list is sufficient.

Ted Rolle

unread,

Jan 16, 2008, 1:51:00 PM1/16/08

to tesser...@googlegroups.com

I've been a programmer for a long time, the bane of management. I believe that if more than 20% of the code has to be modified, it's better to rewrite it from the ground up.

spaghetti code is particularly deserving.

Ted

74yrs old

unread,

Jan 16, 2008, 9:13:12 PM1/16/08

to tesser...@googlegroups.com

Since you are good programmer, why not create code for windows platform for Ocropus - based on tesseract

Ted Rolle

unread,

Jan 17, 2008, 1:15:14 AM1/17/08

to tesser...@googlegroups.com

Who, ME? I"m retired from all that foolishness.

Vivian Landers

unread,

Jan 18, 2008, 6:28:48 PM1/18/08

to tesseract-ocr

Terve Jussi, thanks for your comments.

On Jan 16, 2:17 am, "Jussi Pakkanen" <jpakk...@gmail.com> wrote:
> Tesseract passes some state variables as function parameters but some
> things are in global variables that are scattered all over the code
> base. This makes Tesseract very non-threadsafe and difficult to
> decipher. Wrapping all global variables in a struct and passing it
> along with the functions or something like that would be beneficial.
>
> Like all C programs from the early 90s, Tesseract has its own
> implementation of standard containers such as a linked list. (In fact,
> it has several implementations which all work slightly differently.)
> These are done with preprocessor macros. See for example elst.h.
> Converting to STL containers would be very beneficial.

Cool! These both sound like straightforward things I could work on to
help me ramp up on the code base.

> Tesseract has two different build systems: autotools for Unix and
> Visual studio project files for Windows. It would make more sense to
> have just one build system using CMake, since it supports all these
> platforms (plus a bunch of others such as XCode) natively.

Build system migration is a pain, but if we wanted to expand the
system to new platforms this would be a good first step.

> Documenting how the character matcher actually works. AFAICR the
> system does very convoluted things before calling the matcher. Stuff
> like converting extracted TBLOBs to PBLOBs and back again and calling
> matching functions indirectly through function pointers and so on.

This seems better suited for someone already familiar with the code
base, but it's a good suggestion.

> > And how do I get permission to make contributions?
>

> Usually posting patches to the bug tracker or this list is sufficient.

Thank you, I will! I'd also like to solicit feedback from some of the
primary authors if they're available.

-Vivian

Ted Rolle

unread,

Jan 18, 2008, 8:13:12 PM1/18/08

to tesser...@googlegroups.com

I owe you people an apology: I'm sorry. It was a bad day.

Vivian, you are so enthusiastic! It's refreshing to see.
Grad school? Where? In what?

I believe theraysmith is still about. He's one of the "heavies" for tesseract -- if not the author.

Ray Smith

unread,

Jan 18, 2008, 8:53:15 PM1/18/08

to tesser...@googlegroups.com

Hi Vivian,

Thanks for volunteering. For the benefit of everyone who has expressed an interest in contributing, here is a summary of how I have been running the project so far:

In case you didn't know, we are actively developing (and using) Tesseract here at Google, and we are committed to putting our improvements back into the open source codebase. Changes we make at Google go into our own codebase, source control system etc first, and get tested very carefully, but on a limited set of platforms. As a consequence of the dual codebases, someone (usually me) has to merge changes on the two systems periodically, usually at a point of good stability, and test. The merge and release cycle has therefore turned out to be fairly complex and infrequent - at 2-4 months, but at each one, I try to incorporate as many patches that people have supplied and fix as many of the issues that people have reported as possible. Due to having several different responsibilities, I contribute to this forum on a best efforts basis, but I don't get a great deal of time to give to it.

For making small patches for this and that, this forum (and the issues list) are fine with me, but they won't generally get much attention until a release cycle.

If you have more time to contribute, I will add you as a developer to give you access to the subversion repository. I have a list of projects of varying sizes that I carry round in my head. I occasionally post comments on them to this forum, bat that makes them more difficult to find. I will comment on some of them now, and later post them to a new wiki page on the tesseract-ocr site.

Easy starter project:
1. Convert old variables to new. There are 2 implementations of tunable parameters (called "variables"), an old C relic, and a new C++ implementation. It would be nice to get all the old C-style variables ( variables.h) converted to new C++ style (varable.h). This would enable a tidying up of config files, and a tidying up of the initialization process.

Bigger, but still fairly basic:
2. Thread safety. As someone on this thread already mentioned, it would be useful to gather all the nasty globals together into a Tesseract class, and make the TessBaseAPI instance-based, with a view to making it possible to have two instances of Tesseract running together in the same process. This might be a lot harder than it looks on the surface, but does involve a lot of fairly basic grind.

3. An hOCR output converter might be useful to many people. A lot of people ask for bounding box information, and that might be a good standard way of doing it.

Longer term, harder & requiring deeper knowledge:
4. Get rid of the polygonal approximation. Recognition accuracy could be more accurate, probably with virtually no loss of speed, by getting rid of the polygonal approximation. It would be relatively easy to convert the character classifier, but we need a new chopper that works directly on the C_OUTLINEs. Inter-dependent with:

5. Cut the crap. The top-level code uses C++ classes to describe the outlines, blobs, words, etc, but the lower-level classifier code, and the top-level word classifier both use older C-style structs for the same purpose. While it would be nice to get rid of the old completely and use only the new, it would not really be desirable to do it, until we have a chopper that uses the un-approximated edge-step outlines, as then we could eliminate the polygonal approximation step, and the data structures that support them, completely.

6. More integration with ocropus would be useful. Allowing tesseract and ocropus to share the same training data is one possibility.

Things I would NOT recommend working on:
1. Someone suggested the macro-based list stuff as a candidate for replacement with stl. I will not be incorporating any such changes into the main codebase for 2 reasons:
a. We have been trying to keep stl out of tesseract for a specific reason that I probably shouldn't comment on.
b. Compared to the macro-based lists in tesseract, stl lists are very different, very incompatitble, and IMHO a poor abstraction designed to make them as like vectors as possible, and if you use them the way they are used in tesseract, it would be very slow. Although the rest of stl is very useful, reason (a) still keeps it out of the codebase. It might be possible to sensibly convert the macro-based lists to (mostly) use templates though.

2. Although it is very tempting to try to expand tesseract to new languages, if you did so, you would be overlapping significantly with the work going on at Google. Of course that leaves anyone that wants a different language in the difficult position of either waiting for it to be available, or trying to train it themselves. I will be in a much better position to discuss language compatibility after the next release, by which time there will be much more language support.

Ray.

On Jan 18, 2008 3:28 PM, Vivian Landers <vivian...@gmail.com> wrote:

Terve Jussi, thanks for your comments.

Jussi Pakkanen

unread,

Jan 21, 2008, 8:16:56 AM1/21/08

to tesser...@googlegroups.com

On Jan 19, 2008 3:53 AM, Ray Smith <thera...@gmail.com> wrote:

> In case you didn't know, we are actively developing (and using) Tesseract
> here at Google, and we are committed to putting our improvements back into
> the open source codebase. Changes we make at Google go into our own
> codebase, source control system etc first, and get tested very carefully,
> but on a limited set of platforms.

This is not a very good way of organizing an open source project,
since it does not encourage participation from the community. The open
source mantra is "release early, release often" and this means SVN
development as well. It can be very discouraging for a new developer
to have their hard worked patches dropped because the work was already
done (but kept secret) or because they conflict with certain new
development (which, again, was kept a secret).

> As a consequence of the dual codebases,
> someone (usually me) has to merge changes on the two systems periodically,
> usually at a point of good stability, and test. The merge and release cycle
> has therefore turned out to be fairly complex and infrequent - at 2-4
> months, but at each one, I try to incorporate as many patches that people
> have supplied and fix as many of the issues that people have reported as
> possible.

This is pretty much the same problem as discussed here:

http://thedailywtf.com/Articles/Happy_Merge_Day!.aspx

The lesson: commit early, commit often, merge early, merge often. Not
doing that causes a ton of totally unnecessary work. It would be
beneficial in this case to have a revision control system that handles
merges and private branches better than SVN (that is to say: at all).
However I'm not willing to open that particular can of flameworms at
this time.

As a basis of discussion I propose the following way to handle this issue:

There are three different kinds of changes to Tesseract:

1) those that do not contain any Google Secret Information (TM) such
as bug fixes and general enhancements
2) those with Google Secret Information that will get released when
they are done
3) those with Google Secret Information that will never, ever get released

Type 1 code is the kind that should be pushed to public SVN as soon as
possible. Even better would be if they were developed in the public
SVN and only then imported to Google's internal branch.

Type 2 would be developed inside Google and then pushed to public SVN.
But again, as early as possible. I guess these would be added just
before releases as is currently the case.

Type 3 is Google's internal issue and thus not our problem.

This approach would try to satisfy the two slightly contradicting goals:

- public SVN is as up to date with project development as is possible,
enabling community participation
- Google gets to keep their secret bits secret, keeping corporate bean
counters happy

Vivian Landers

unread,

Jan 21, 2008, 1:12:40 PM1/21/08

to tesseract-ocr

Hi Ray, thanks for your response. I responded to this before but that
response didn't show up and may have been lost, so just posting again.
I didn't know before about Google's involvement with the project and
your integration schedule, and that info was helpful.

On Jan 18, 5:53 pm, "Ray Smith" <theraysm...@gmail.com> wrote:
> Easy starter project:
> 1. Convert old variables to new.

> Bigger, but still fairly basic:
> 2. Thread safety.

> 3. An hOCR output converter might be useful to many people.

These sound straightforward and like they'd be great for ramping up on
the codebase. The other projects you listed also sound interesting but
I'd like to get some more experience with Tesseract before considering
tackling them.

> a. We have been trying to keep stl out of tesseract for a specific reason
> that I probably shouldn't comment on.
> b. Compared to the macro-based lists in tesseract, stl lists are very
> different, very incompatitble, and IMHO a poor abstraction designed to make
> them as like vectors as possible, and if you use them the way they are used
> in tesseract, it would be very slow. Although the rest of stl is very
> useful, reason (a) still keeps it out of the codebase. It might be possible
> to sensibly convert the macro-based lists to (mostly) use templates though.

I'm curious about your reasons for excluding the STL, but in any case
I love killing macros and would be happy to introduce templates here,
with suitable testing to prevent performance regression.

> For making small patches for this and that, this forum (and the issues list)
> are fine with me, but they won't generally get much attention until a
> release cycle. If you have more time to contribute, I will add you as a developer to give
> you access to the subversion repository.

I plan to make this a priority, so I should be able to contribute
substantial time. It would be great if I could get SVN repository
access to help streamline contributions. Please let me know if this is
possible.

Thanks again for your feedback and I look forward to working with you.

-Vivian

> > -Vivian- Hide quoted text -
>
> - Show quoted text -

Scan...@gmail.com

unread,

Jan 25, 2008, 8:04:46 PM1/25/08

to tesseract-ocr

Well, at least we have a list. How do we divi it up?

Reply all

Reply to author

Forward