Nick
--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
On 28 May 2013 18:26, Nick White <nick....@durham.ac.uk> wrote:There are tif/box files for the Tesseract 2 language packs. There were
> Hi Tesseractors,
>
> I am feeling rather fed up about the lack of openness with the
> Tesseract project.
>
> The addition of the cube mode, and several trainings, with
> absolutely no documentation, or (as far as I can tell) any tools to
> create cube training files, is a good example of this.
>
> As is the lack of tif/box files for any of the core training files
> in the project.
>
no such files for Tesseract 3, they were generated. Ray has expressed
an interest in opening the tool(s) for generating the files, but they
have to be decoupled from Google's internal infrastructure.
I'm pretty sure that's _not_ the case.
> I suspect some of the tif/box files for training aren't being
> released because of concerns about copyright of the image files. If
> that's the case please work to clear them up, or create freely
> reusable versions.
>
There is no cabal :)
> Otherwise it feels like there is a split between the Google
> Tesseract cabal (to use the most dramatic language possible :p) and
> the community.
>
No, seriously. There is none. Any commit you see in the repository was
made by a volunteer, it's just that some (two, last count) of them
happen to have @google.com addresses, access to (some of) the work
Google has done on Tesseract, and permission to release it.
Other than that, yes, there quite obviously is a split between the
Google volunteers and the other volunteers: for one, the work done in
the open by the latter makes it harder for the former to open Google's
internal work. Not impossibly hard, but when time is limited -- as it
quite clearly is -- doing a three-way merge does not exactly rank
highly on anyone's list of fun things to do.
--
> I love Tesseract; having a very high quality free software OCR
> package is awesome, and I'm very grateful for the amazing work being
> done on it. But I find the lack of parity between those inside
> Google and the wider community to be rather troubling.
>
> If there's anything I can do to help make cube training tools and
> documentation available, or the training source files, I'd be very
> happy to help. Replying offlist if appropriate is fine.
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you
On Mon, Jul 15, 2013 at 1:05 PM, Nick White <nick....@durham.ac.uk> wrote:
Hi all, thanks for replying. I'll reply to some things inline below.
> Work on open-sourcing the most important training tools (those that don't rely
> on map-reduce) is now well under way. At last!
Brilliant, I look forward to seeing what they look like very much!
And I'm happy to help out with portability of the new tools.
> ☆ change format of all comment in source code to doxygen format - I
> found out that there is a lot of comments that are not part of
> doxygen documentation just because of "wrong" format. This is most
> questionable tasks for me because it could cause additional
> problems when Google will try to sync there code with svn ;-) On
> other hand this could be most useful task for better understanding
> of code ;-)
>
> Hmm. Not so keen on this, but I can see why it would be useful to the open
> source community. The best time to do it is right after I have done a major
> update, so I can get it back into the Google codebase without a major 3-way
> diff.
FWIW I also don't particularly care for doxygen comments. I find
reading code to generally be far more useful than browsing some
weird doxygen documentation format. Zdenko, why do you like the
idea?
Any update on this? If there's anything I can do to help bring these
tools open-source, do let me know, I can't wait to see them :)
Next to come out will be a tool to add the "new" properties to a unicharset file. This will be accompanied by a set of "universal" unicharsets that contain the properties that have been set from a large number of fonts. Access to fonts is the biggest hindrance to making the training process more open, but with this solution most of that dependence goes away.After that will be a tool to generate tiff and box files from some text and a set of fonts. It is going to take a rewrite and possibly some changes to portability, like training won't work on windows, unless/until you guys can help fix it.