Cube documentation, training source files, and general lack of openness

Nick White

unread,

May 28, 2013, 1:26:27 PM5/28/13

to tesser...@googlegroups.com

Hi Tesseractors,

I am feeling rather fed up about the lack of openness with the
Tesseract project.

The addition of the cube mode, and several trainings, with
absolutely no documentation, or (as far as I can tell) any tools to
create cube training files, is a good example of this.

As is the lack of tif/box files for any of the core training files
in the project.

Keeping the cube tools and documentation private sucks royally. If
they aren't perfect or polished, it doesn't matter; we could help
to fix them up!

I suspect some of the tif/box files for training aren't being
released because of concerns about copyright of the image files. If
that's the case please work to clear them up, or create freely
reusable versions.

Otherwise it feels like there is a split between the Google
Tesseract cabal (to use the most dramatic language possible :p) and
the community.

I love Tesseract; having a very high quality free software OCR
package is awesome, and I'm very grateful for the amazing work being
done on it. But I find the lack of parity between those inside
Google and the wider community to be rather troubling.

If there's anything I can do to help make cube training tools and
documentation available, or the training source files, I'd be very
happy to help. Replying offlist if appropriate is fine.

Nick

zdenko podobny

unread,

Jun 9, 2013, 3:21:56 PM6/9/13

to tesser...@googlegroups.com

Hi Nick,

I understand your feelings, but not publishing something is OK from point of licence ;-). Original code was released (in some extent AFAIK) by HP [0] Have a look at that site and code. Check and compare today information and tools with that. AFAIK if project is using Apache licence main reason is that contributors/users are not obliged to publish their improvement....

Than Google contribute code (2.x, 3.x version). They were (are) not obliged to do it, but they did. Maybe you already recognized that Google have some issues regarding their openness[1].

Releasing something (cube) without documentation is not the best way - I agree. But they could decide to not release it - and its worse choice ;-). On other hand Ray/Google never stated that they will not release it ;-). It looks like they are behind they release plan ;-) - have a look at leptonica - there were several promises to release version 1.70[5] that did not take place. BTW Dan Bloomberg (author of Leptonica) works for Google too[6].

What I really miss is some Roadmap, guidance for future changes, response from time to time. Last response I found (from Ray was) from beginning of January 2013...

It would be great if somebody from Google will give clear statement what they plan to do (e.g. we will not release more docs, we will not give other training data, we will change training tools within next x month etc). This would help us as tesseract community to set up targets efficiently (it does not make sense to fix doc when there will be change soon...)

I check my e-mails and here is my understandings of current status:

Google runs tesseract within its systems, so building systems (autotools on Linux, VC++ solutions on Windows etc...) are community tasks[2]
Maintaining of Google project site is community task[2] (ok, we do not have full access, but we can do most of the tasks)
Issues regarding tesseract development should be address to tesseract-dev forum.This is reason why I reply here and not to the same -email in tesseract-ocr forum[3]. If i got it right Ray is filtering tesseract-ocr.
Direct contacting developers (and contributors ;-) ) without using forums does not helps.
Google does not use hOCR internally. IMO so extending/improving of this part is community task.
IMO C-API is community task too
I do not believe that current language data files were created only with described process/tools on Training wiki ;-). Also Ray stated this[4]. I think that post is still valid (it was at before releasing 3.01 version). So asking for tiff/box IMO does not make sense. More interesting would be if there is possibility for incremental training (e.g. missing sign).
There are several tasks related to documentation that we can do:

wiki related image processing (some basing cases) - it is already part of our experience that pre-processing images give better result than re-training...
wiki related to basic usage of tesseract API (C and C++) - there are already some examples in forum, but putting it on one place would help.
change format of all comment in source code to doxygen format - I found out that there is a lot of comments that are not part of doxygen documentation just because of "wrong" format. This is most questionable tasks for me because it could cause additional problems when Google will try to sync there code with svn ;-) On other hand this could be most useful task for better understanding of code ;-)

I like your idea to document code[7]. Maybe also to have space for short notes ;-) I found myself to find something from time to time but I did not have time to elaborate on that - maybe sharing it could help somebody to concentrate on it...

[0] http://sourceforge.net/projects/tesseract-ocr/?source=dlp

[1] http://androidheadlines.com/2013/05/larry-page-implies-microsoft-is-taking-advantage-of-googles-openness-while-offering-nothing-back.html

[2] http://groups.google.com/group/tesseract-ocr/msg/3ecd70e7a8780a63

[3] http://groups.google.com/group/tesseract-ocr/browse_thread/thread/1cd2b20bc375f43

[4] http://groups.google.com/group/tesseract-dev/msg/1cdf3ebe8743d935?hl=en

[5] http://code.google.com/p/leptonica/issues/detail?id=64

[6] http://www.mostlycolor.ch/2009/10/to-bits-and-back-again.html

[7] http://groups.google.com/group/tesseract-ocr/msg/58849c832bbf5606?

Zdenko

Nick

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Perry Horwich

unread,

Jun 30, 2013, 7:38:32 PM6/30/13

to tesser...@googlegroups.com

I just want to add that I am grateful for you starting this thread. I am new to OCR, but a veteran image processing programmer. I have been surprised more than once where the openness of this opensource project suddenly seems to dry up. It's too bad. I hoped to extend what I have found here and give back to the OS community. Struggling with the items raised here is a deal breaker. I just don't have the time to find out later that I may have re-invented the wheel. Not sharing hurts many, profits few, and ultimately limits the potential of a project like this. Not the noblest of principles, but not unfamiliar either. Just too bad.

Jimmy O'Regan

unread,

Jul 12, 2013, 5:52:45 PM7/12/13

to tesser...@googlegroups.com

On 28 May 2013 18:26, Nick White <nick....@durham.ac.uk> wrote:
> Hi Tesseractors,
>
> I am feeling rather fed up about the lack of openness with the
> Tesseract project.
>
> The addition of the cube mode, and several trainings, with
> absolutely no documentation, or (as far as I can tell) any tools to
> create cube training files, is a good example of this.
>
> As is the lack of tif/box files for any of the core training files
> in the project.
>

There are tif/box files for the Tesseract 2 language packs. There were
no such files for Tesseract 3, they were generated. Ray has expressed
an interest in opening the tool(s) for generating the files, but they
have to be decoupled from Google's internal infrastructure.

Part of the problem is that this internal infrastructure does not have
a direct open source equivalent, so releasing it as-is would not be
useful. I assume that another part is that this infrastructure is
effectively (if not in actuality) trade secret, so releasing
information about it is simply not an option.

> Keeping the cube tools and documentation private sucks royally. If
> they aren't perfect or polished, it doesn't matter; we could help
> to fix them up!
>

I think it's more likely that these tools, like the image/box
generators, are tightly coupled to Google's internal infrastructure,
and can't be opened without a rewrite.

To take possibly the simplest example: the word-freq file. To get
something similar, you can just do:

$ cat mybigtextfile.txt | tr ' ' '\n' | sort | uniq -c | awk '{print
$2 "\t " $1}' | sort > mylang.cube.word-freq

On massive corpora, like Google have, it's a great task for MapReduce
(in fact, it's the first example you get in the Hadoop tutorial
(http://hadoop.apache.org/docs/stable/mapred_tutorial.html) or on the
Spark website (http://spark-project.org/)).

Similarly, the neural network code is missing the training pieces;
Google have recently published a few papers about their distributed
neural network infrastructure, which is not available to the outside
world. On the bright side, there is open source code available for
training convolution networks. I think it should be possible to
convert the output for use with Cube, but I honestly don't know enough
about neural nets to start looking. (On the other hand, there's no
indication of what the 'hybrid' neural network code is a hybrid _of_,
so it's not all good).

> I suspect some of the tif/box files for training aren't being
> released because of concerns about copyright of the image files. If
> that's the case please work to clear them up, or create freely
> reusable versions.
>

I'm pretty sure that's _not_ the case.

> Otherwise it feels like there is a split between the Google
> Tesseract cabal (to use the most dramatic language possible :p) and
> the community.
>

There is no cabal :)

No, seriously. There is none. Any commit you see in the repository was
made by a volunteer, it's just that some (two, last count) of them
happen to have @google.com addresses, access to (some of) the work
Google has done on Tesseract, and permission to release it.

Other than that, yes, there quite obviously is a split between the
Google volunteers and the other volunteers: for one, the work done in
the open by the latter makes it harder for the former to open Google's
internal work. Not impossibly hard, but when time is limited -- as it
quite clearly is -- doing a three-way merge does not exactly rank
highly on anyone's list of fun things to do.

> I love Tesseract; having a very high quality free software OCR
> package is awesome, and I'm very grateful for the amazing work being
> done on it. But I find the lack of parity between those inside
> Google and the wider community to be rather troubling.
>
> If there's anything I can do to help make cube training tools and
> documentation available, or the training source files, I'd be very
> happy to help. Replying offlist if appropriate is fine.

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Ray Smith

unread,

Jul 14, 2013, 10:07:06 PM7/14/13

to tesser...@googlegroups.com

Hi, thanks for starting this thread.

Firstly I would like to apologize for my lack of communication.

My excuse is I never used to like writing letters (with pen and paper) and that has transferred over to email. I get so much of it to deal with at work that it is the last thing I want to do when I get home.

Anyway some answers inline. Actually Zdenko's answers are mostly spot-on, so I will start there.

Google runs tesseract within its systems, so building systems (autotools on Linux, VC++ solutions on Windows etc...) are community tasks[2]

Correct.

Maintaining of Google project site is community task[2] (ok, we do not have full access, but we can do most of the tasks)

I can easily add more people to full access. Please tell me who to what. Already added Nick and Zdenko to this group.

Issues regarding tesseract development should be address to tesseract-dev forum.This is reason why I reply here and not to the same -email in tesseract-ocr forum[3]. If i got it right Ray is filtering tesseract-ocr.

Right.

Direct contacting developers (and contributors ;-) ) without using forums does not helps.

Agreed.

Google does not use hOCR internally. IMO so extending/improving of this part is community task.

Not much anyway, so yes.

IMO C-API is community task too

Definitely.

I do not believe that current language data files were created only with described process/tools on Training wiki ;-). Also Ray stated this[4]. I think that post is still valid (it was at before releasing 3.01 version). So asking for tiff/box IMO does not make sense. More interesting would be if there is possibility for incremental training (e.g. missing sign).

Work on open-sourcing the most important training tools (those that don't rely on map-reduce) is now well under way. At last!

There are several tasks related to documentation that we can do:

wiki related image processing (some basing cases) - it is already part of our experience that pre-processing images give better result than re-training...

Sounds like a good idea.

wiki related to basic usage of tesseract API (C and C++) - there are already some examples in forum, but putting it on one place would help.

Agreed.

change format of all comment in source code to doxygen format - I found out that there is a lot of comments that are not part of doxygen documentation just because of "wrong" format. This is most questionable tasks for me because it could cause additional problems when Google will try to sync there code with svn ;-) On other hand this could be most useful task for better understanding of code ;-)

Hmm. Not so keen on this, but I can see why it would be useful to the open source community. The best time to do it is right after I have done a major update, so I can get it back into the Google codebase without a major 3-way diff.

I

On Fri, Jul 12, 2013 at 2:52 PM, Jimmy O'Regan <jor...@gmail.com> wrote:

On 28 May 2013 18:26, Nick White <nick....@durham.ac.uk> wrote:
> Hi Tesseractors,
>
> I am feeling rather fed up about the lack of openness with the
> Tesseract project.
>
> The addition of the cube mode, and several trainings, with
> absolutely no documentation, or (as far as I can tell) any tools to
> create cube training files, is a good example of this.
>
> As is the lack of tif/box files for any of the core training files
> in the project.
>

There are tif/box files for the Tesseract 2 language packs. There were
no such files for Tesseract 3, they were generated. Ray has expressed
an interest in opening the tool(s) for generating the files, but they
have to be decoupled from Google's internal infrastructure.

Next to come out will be a tool to add the "new" properties to a unicharset file. This will be accompanied by a set of "universal" unicharsets that contain the properties that have been set from a large number of fonts. Access to fonts is the biggest hindrance to making the training process more open, but with this solution most of that dependence goes away.

After that will be a tool to generate tiff and box files from some text and a set of fonts. It is going to take a rewrite and possibly some changes to portability, like training won't work on windows, unless/until you guys can help fix it.

Hmm cube.

Its a dead-end. It didn't really contribute very much so nobody at google wants to work on it. The new deep belief nets that you have read about on the other hand, yes. The *only* thing cube was really good at was Hindi.

> I suspect some of the tif/box files for training aren't being
> released because of concerns about copyright of the image files. If
> that's the case please work to clear them up, or create freely
> reusable versions.
>

I'm pretty sure that's _not_ the case.

Problem 1.

Well actually it is. Now say we had used a commercially available font in the training process, and our commercial license allowed us to do that, but not publish the images. (Not yet the case, but could well be.) Now say we had determined that font blah (that is not freely available) was needed for the best recognition accuracy over a very large test set. How do you propose that we create a freely reusable version that isn't inferior to the one that we use? To me it would be far better to release the tools that makes the tif/box files from text and fonts, and state which fonts we used. Then if you really want exactly the same results, you could just go and buy the fonts required for a few tens of hundred of $.

Problem 2.

To release all the tif and box files for all the languages that we train would run into 100s of GB, maybe only 10s of GB compressed, but it would require the setup and maintenance headache of yet another download site.

Problem 3.

our current training code actually throws away the intermediate data (the tif and box files), so I don't have them for any languages other than ones for which I am working specifically to improve training.

> Otherwise it feels like there is a split between the Google
> Tesseract cabal (to use the most dramatic language possible :p) and
> the community.
>

There is no cabal :)

No, seriously. There is none. Any commit you see in the repository was
made by a volunteer, it's just that some (two, last count) of them
happen to have @google.com addresses, access to (some of) the work
Google has done on Tesseract, and permission to release it.

Other than that, yes, there quite obviously is a split between the
Google volunteers and the other volunteers: for one, the work done in
the open by the latter makes it harder for the former to open Google's
internal work. Not impossibly hard, but when time is limited -- as it
quite clearly is -- doing a three-way merge does not exactly rank
highly on anyone's list of fun things to do.

Actually part of the reason for the lack of updates recently has been a combination of failure to produce the next major improvement and the fact that there is such a thirst for documentation. Cube is a perfect example. It doesn't do much useful, yet now everybody wants it documented, so there is no way I can commit another half-baked experiment that isn't production-ready that everybody will want documented. I have 3 new classifiers in addition to cube that haven't delivered on their early promise. It really is hard to beat the current classifier, although I am starting to understand why a little better.

The good news is that I really really want to get the Google version of the code cleaned up and synced with the outside world this quarter, as there are some improvements in there worth having.

> I love Tesseract; having a very high quality free software OCR
> package is awesome, and I'm very grateful for the amazing work being
> done on it. But I find the lack of parity between those inside
> Google and the wider community to be rather troubling.
>
> If there's anything I can do to help make cube training tools and
> documentation available, or the training source files, I'd be very
> happy to help. Replying offlist if appropriate is fine.

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

Nick White

unread,

Jul 15, 2013, 7:05:05 AM7/15/13

to tesser...@googlegroups.com

Hi all, thanks for replying. I'll reply to some things inline below.

> Work on open-sourcing the most important training tools (those that don't rely
> on map-reduce) is now well under way. At last!

Brilliant, I look forward to seeing what they look like very much!
And I'm happy to help out with portability of the new tools.

> ☆ change format of all comment in source code to doxygen format - I

> found out that there is a lot of comments that are not part of
> doxygen documentation just because of "wrong" format. This is most
> questionable tasks for me because it could cause additional
> problems when Google will try to sync there code with svn ;-) On
> other hand this could be most useful task for better understanding
> of code ;-)
>
> Hmm. Not so keen on this, but I can see why it would be useful to the open
> source community. The best time to do it is right after I have done a major
> update, so I can get it back into the Google codebase without a major 3-way
> diff.

FWIW I also don't particularly care for doxygen comments. I find
reading code to generally be far more useful than browsing some
weird doxygen documentation format. Zdenko, why do you like the
idea?

> Actually part of the reason for the lack of updates recently has been a
> combination of failure to produce the next major improvement and the fact that
> there is such a thirst for documentation. Cube is a perfect example. It doesn't
> do much useful, yet now everybody wants it documented, so there is no way I can
> commit another half-baked experiment that isn't production-ready that everybody
> will want documented. I have 3 new classifiers in addition to cube that haven't
> delivered on their early promise.

Ah, OK, that's interesting. I think the main reason people (myself
included) kept on about documenting cube and releasing support tools
around it it was the belief that it was to be The Way of the future.
I don't think anyone would object to the addition of experimental
classifiers if they were marked as such.

> It really is hard to beat the current
> classifier, although I am starting to understand why a little better.
> The good news is that I really really want to get the Google version of the
> code cleaned up and synced with the outside world this quarter, as there are
> some improvements in there worth having.

That's good and exciting news. I look forward to seeing the New
Stuff :)

Thanks folks!

Nick

zdenko podobny

unread,

Jul 15, 2013, 5:50:05 PM7/15/13

to tesser...@googlegroups.com

zdenko podobny

unread,

Jul 16, 2013, 4:21:14 PM7/16/13

to tesser...@googlegroups.com

On Mon, Jul 15, 2013 at 11:50 PM, zdenko podobny <zde...@gmail.com> wrote:

On Mon, Jul 15, 2013 at 1:05 PM, Nick White <nick....@durham.ac.uk> wrote:

Hi all, thanks for replying. I'll reply to some things inline below.

> Work on open-sourcing the most important training tools (those that don't rely
> on map-reduce) is now well under way. At last!

Brilliant, I look forward to seeing what they look like very much!
And I'm happy to help out with portability of the new tools.

> ☆ change format of all comment in source code to doxygen format - I
> found out that there is a lot of comments that are not part of
> doxygen documentation just because of "wrong" format. This is most
> questionable tasks for me because it could cause additional
> problems when Google will try to sync there code with svn ;-) On
> other hand this could be most useful task for better understanding
> of code ;-)
>
> Hmm. Not so keen on this, but I can see why it would be useful to the open
> source community. The best time to do it is right after I have done a major
> update, so I can get it back into the Google codebase without a major 3-way
> diff.

FWIW I also don't particularly care for doxygen comments. I find
reading code to generally be far more useful than browsing some
weird doxygen documentation format. Zdenko, why do you like the
idea?

The main reason - already part of code use doxygen comment style. Based on it there is created doxygen documentation for tesseract-ocr available[1].

When I made it online[2] I was not aware it is not completed. I do not think it is good to leave it as it is. And reverting it back it does not make sense.

I agree we should wait until next contribution from Google and than fix it before releasing next version.

[1] https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02-doc-html.tar.gz

[2] https://groups.google.com/d/msg/tesseract-dev/UzCNyoiunlQ/kX6bc4uBNbQJ

Nick White

unread,

Nov 4, 2013, 11:40:45 AM11/4/13

to tesser...@googlegroups.com

On Sun, Jul 14, 2013 at 07:07:06PM -0700, Ray Smith wrote:
> 1. I do not believe that current language data files were created only with

> described process/tools on Training wiki ;-). Also Ray stated this[4]. I
> think that post is still valid (it was at before releasing 3.01 version).
> So asking for tiff/box IMO does not make sense. More interesting would be
> if there is possibility for incremental training (e.g. missing sign).
>
> Work on open-sourcing the most important training tools (those that don't rely
> on map-reduce) is now well under way. At last!

Any update on this? If there's anything I can do to help bring these
tools open-source, do let me know, I can't wait to see them :)

As an aside, I'm very much looking forward to seeing how the line
segmentation in 3.03 compares to 3.02.02 for diacritics, when I have
the time to test it.

Nick

Andreas Romeyke

unread,

Nov 6, 2013, 10:24:17 AM11/6/13

to tesser...@googlegroups.com

Hello,

Am Montag, 4. November 2013 17:40:45 UTC+1 schrieb Nick White:

Any update on this? If there's anything I can do to help bring these
tools open-source, do let me know, I can't wait to see them :)

If help is needed, please do not hesitate to ask me, too.

Bye Andreas

Shree

unread,

Nov 7, 2013, 11:28:26 AM11/7/13

to tesser...@googlegroups.com

Next to come out will be a tool to add the "new" properties to a unicharset file. This will be accompanied by a set of "universal" unicharsets that contain the properties that have been set from a large number of fonts. Access to fonts is the biggest hindrance to making the training process more open, but with this solution most of that dependence goes away.

After that will be a tool to generate tiff and box files from some text and a set of fonts. It is going to take a rewrite and possibly some changes to portability, like training won't work on windows, unless/until you guys can help fix it.

I tried running training with tesseract 3.03 compiled under cygwin today.

There is a new program called 'set_unicharset_properties' but it looks a directory with scripts/fonts - I guess what Ray is referring to as "universal" unicharsets.

Also, there already are tools that generate box and tiff files from some text for different fonts - I use JTessBoxeditor by Quan. Does the new tool by Google do something different?

Any idea when these will be made available.

I understand that the box/tiff pairs used for training may not be made available. But what about the new traineddata files?

Thanks!

Reply all

Reply to author

Forward