> profitably be incorporated into it? What practical work needs to be
> done on things like the build system, installer, code refactoring,
> documentation, and so on?
Here are some things which I have considered while looking at the code:
Documenting how the character matcher actually works. AFAICR the
system does very convoluted things before calling the matcher. Stuff
like converting extracted TBLOBs to PBLOBs and back again and calling
matching functions indirectly through function pointers and so on.
Tesseract passes some state variables as function parameters but some
things are in global variables that are scattered all over the code
base. This makes Tesseract very non-threadsafe and difficult to
decipher. Wrapping all global variables in a struct and passing it
along with the functions or something like that would be beneficial.
Like all C programs from the early 90s, Tesseract has its own
implementation of standard containers such as a linked list. (In fact,
it has several implementations which all work slightly differently.)
These are done with preprocessor macros. See for example elst.h.
Converting to STL containers would be very beneficial.
Tesseract has two different build systems: autotools for Unix and
Visual studio project files for Windows. It would make more sense to
have just one build system using CMake, since it supports all these
platforms (plus a bunch of others such as XCode) natively.
> And how do I get permission to make contributions? Thanks a lot for your help!
Usually posting patches to the bug tracker or this list is sufficient.
Terve Jussi, thanks for your comments.
> In case you didn't know, we are actively developing (and using) Tesseract
> here at Google, and we are committed to putting our improvements back into
> the open source codebase. Changes we make at Google go into our own
> codebase, source control system etc first, and get tested very carefully,
> but on a limited set of platforms.
This is not a very good way of organizing an open source project,
since it does not encourage participation from the community. The open
source mantra is "release early, release often" and this means SVN
development as well. It can be very discouraging for a new developer
to have their hard worked patches dropped because the work was already
done (but kept secret) or because they conflict with certain new
development (which, again, was kept a secret).
> As a consequence of the dual codebases,
> someone (usually me) has to merge changes on the two systems periodically,
> usually at a point of good stability, and test. The merge and release cycle
> has therefore turned out to be fairly complex and infrequent - at 2-4
> months, but at each one, I try to incorporate as many patches that people
> have supplied and fix as many of the issues that people have reported as
> possible.
This is pretty much the same problem as discussed here:
http://thedailywtf.com/Articles/Happy_Merge_Day!.aspx
The lesson: commit early, commit often, merge early, merge often. Not
doing that causes a ton of totally unnecessary work. It would be
beneficial in this case to have a revision control system that handles
merges and private branches better than SVN (that is to say: at all).
However I'm not willing to open that particular can of flameworms at
this time.
As a basis of discussion I propose the following way to handle this issue:
There are three different kinds of changes to Tesseract:
1) those that do not contain any Google Secret Information (TM) such
as bug fixes and general enhancements
2) those with Google Secret Information that will get released when
they are done
3) those with Google Secret Information that will never, ever get released
Type 1 code is the kind that should be pushed to public SVN as soon as
possible. Even better would be if they were developed in the public
SVN and only then imported to Google's internal branch.
Type 2 would be developed inside Google and then pushed to public SVN.
But again, as early as possible. I guess these would be added just
before releases as is currently the case.
Type 3 is Google's internal issue and thus not our problem.
This approach would try to satisfy the two slightly contradicting goals:
- public SVN is as up to date with project development as is possible,
enabling community participation
- Google gets to keep their secret bits secret, keeping corporate bean
counters happy