I don't know if this is currently relevant, but as for me, I wouldn't
investigate much time in studying the Cube's behavior (at least for
the moment) as it certainly will undergo many substantial source code
corrections (this can even be found in the source code comments), as
will do the way of interaction between Tesseract and Cube. Currently
Tesseract segments everything itself and then passes segmented results
to Cube on the word-by-word basis. Then some selection happens for who
of the two did better OCR: Tess or Cube.
However if you still wish to dig, refer to "cube_control.cpp" and the
"cube" source directory.
HTH
Warm regards,
Dmitri Silaev
www.CustomOCR.com
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>
First of all, did you train a new font using your source images? For
the image you've shown before, it's still a crucial stage to gain
success, be it with dictionary or without. Your postal address font is
very specific.
Simplistically, Tesseract's word matching is almost an exhaustive
enumeration of "chop" points. In other words, enumeration of connected
component partitions. Pixels between every pair of chop points are
thought as potential symbols and are being matched against trained
templates. Some best matches are saved and then "permuted" using
various methods to get possible word choices. Dictionary in some
degree is deemed as a "permuter".
I've made some basic checks for how dictionary is working in the
current revision, and from what I've seen I think it's fine. But if
your training glyphs are very different from those you are trying to
recognize, the dictionary permuter won't have any chance to come into
play.
Warm regards,
Dmitri Silaev
www.CustomOCR.com