Cube documentation, training source files, and openness

928 views
Skip to first unread message

Nick White

unread,
May 30, 2013, 10:48:01 AM5/30/13
to tesser...@googlegroups.com
Hi Tesseractors,

I am feeling a bit fed up about the lack of openness with the
Tesseract project.

The addition of the cube mode, and several trainings, with
absolutely no documentation, or (as far as I can tell) any tools to
create cube training files, is a good example of this.

As is the lack of tif/box files for any of the core training files
in the project.

Keeping the cube tools and documentation private sucks royally. If
they aren't perfect or polished, it doesn't matter; we could help
to fix them up!

I suspect some of the tif/box files for training aren't being
released because of concerns about copyright of the image files. If
that's the case please work to clear them up, or create freely
reusable versions.

I love Tesseract; having a very high quality free software OCR
package is awesome, and I'm very grateful for the amazing work being
done on it. But I find the lack of parity between those inside
Google and the wider community to be rather troubling.

If there's anything I can do to help make cube training tools and
documentation available, or the training source files, I'd be very
happy to help. Replying offlist if appropriate is fine.

Nick

Dmitri Silaev

unread,
May 30, 2013, 11:32:52 AM5/30/13
to tesser...@googlegroups.com
Excellent post, Nick! The more I read, the more I felt I had to ask
these questions myself, but didn't yet. I'm afraid, though, many of
them would remain unanswered.

Because after several years of monitoring and asking in this forum I
got used to the feeling that principal developers make only new
release announcements. In the early years, they were much more active
in discussions. I can suppose many of forum questions are tedious to
answer over and over again, the forum search can be used, and many
people just feel lazy to use it. But some of them are not like that
and deserve answers.

Now it looks like Google is doing us a favor making a formerly
commercial engine outsource and sharing its developments from time to
time. The community contribution now is constrained by enhancing
release packages and fixing trivial bugs. Without a proper
documentation or at least clues on how all this (not only Cube) works,
developers keep community contribution nominal. I personally need more
info and am ready to contribute, if I begin to understand the code
enough. I used to surf the code alone, but the potential of this
approach is limited. Off the bat, I'm interested in segmentation,
details on class pruner and integer matcher, description of Cube, best
practices on training data generation. I think, there are more to
come, once I get more info on these.

--
Dmitri
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

TP

unread,
May 30, 2013, 4:20:18 PM5/30/13
to tesseract-ocr
On Thu, May 30, 2013 at 8:32 AM, Dmitri Silaev <daemo...@gmail.com> wrote:
The community contribution now is constrained by enhancing
release packages and fixing trivial bugs. Without a proper
documentation or at least clues on how all this (not only Cube) works,
developers keep community contribution nominal.

+1 (Except of course for Zdenko who puts in lots of work to release tesseract)

Nick White

unread,
Jun 3, 2013, 10:25:51 AM6/3/13
to tesser...@googlegroups.com
I wonder, would others here be interested in figuring out and
documenting little bits of how the code works?

I spent some time in the line segmentation code a little while ago,
to figure out better configuration parameters for line segmentation
for the Ancient Greek training (which ended up being pretty
successful), and I could certainly contribute a partial description
of how it works.

If others are interested in doing this for key sections (like the
parts Dmitri suggested), perhaps we should set up a wiki and get to
work? It wouldn't be comprehensive, of course, but sharing what we
know could still prove pretty useful.

What do people think? Is anyone else interested in doing this?

I'll dig out the (very scrappy) notes I made on line segmentation,
clean them up, and post them here, when I get time. If anyone else
is interested, I'll set up a wiki somewhere.

Nick

Sven Pedersen

unread,
Jun 3, 2013, 10:49:46 AM6/3/13
to tesser...@googlegroups.com
Sounds good. I think we should make some attempt to reverse engineer the Cube engine. I imagine Google will eventually release documentation, but we don't know when, if we document it they may be more inclined to give their side of it more quickly. It is very possible they don't have much internal documentation anyway.
--Sven
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Shree Devi Kumar

unread,
Jun 3, 2013, 1:18:14 PM6/3/13
to tesser...@googlegroups.com
Great idea! 

I would suggest putting the documentation in a wiki instead of here. That way it will be easier to refer to and find later.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nick White

unread,
Jun 10, 2013, 12:58:06 PM6/10/13
to tesser...@googlegroups.com
Right then, I've created a wiki in Google Code for this collected
effort.

https://code.google.com/p/tesseract-ocr-extradocs/

I have spent some time this last week reading some of the cube code
and figuring out the purpose of the various cube training files. I
still don't know the most interesting stuff, which is exactly how
the .nn files are used, but it was taking me a while to read the
code so I though I'd just post what I have so far.

If anyone wants to add to the wiki let me know and I'll gladly add
you to the project.

The next thing on my list to document is line segmentation, though I
should probably try to add more information on how cube works first.

I hope this looks useful to people, and inspires everyone to dig
into all of the code :)

Nick

Shree Devi Kumar

unread,
Jun 14, 2013, 4:34:49 AM6/14/13
to tesser...@googlegroups.com
Thanks, Nick.

It is good to have some cube info. Please add the list of languages that use cube mode. I know that Hindi uses option 2 i.e. combined cube and tess mode.

Regarding neural networks, I have read that nn has been removed from tesseract as it was not open source. That may explain why there is minimal nn code in 3.02. Please see: http://www.cedricve.me/2013/04/12/how-to-train-tesseract/

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


Reply all
Reply to author
Forward
0 new messages