Using tesseract in CUBE mode

4,284 views
Skip to first unread message

Amrit

unread,
Apr 14, 2011, 4:05:37 PM4/14/11
to tesseract-ocr
Hi All,
Pursuing my ongoing work on trying to develop a postal
address recognizer , I was excited to discover the implementation of
CUBE mode and especially the thought that it might be used to
incorporate some language modelling techniques along with tesseract.
I believe that one should be able to activate it just by
changing the tesseract's initialization mode from
api.Init(argv[0], lang, tesseract::OEM_DEFAULT,
&(argv[arg]), argc - arg, NULL, NULL, false);
to:
api.Init(argv[0], lang, tesseract::OEM_CUBE_ONLY,
&(argv[arg]), argc - arg, NULL, NULL, false);

On doing so I have some queries :

1) I was wondering as to what exactly is the difference between
OEM_CUBE_ONLY and OEM_TESSERACT_CUBE_COMBINED ?
At a high level from the little material I could get on cube
implementation , I understand that using tesseract in cube mode can
improve the performance(especially in connected char set like
arabic).I am trying to use it to recognize English alpha numeric text
alone and thus would it be safe to expect a better accuracy?
So far on couple images I tested it on, the results have not shown any
remarkable improvements.

2) Furthermore,under tessdata I am seeing files such as
1) eng.cube.lm - which contains listing of whitelist characters
which seem to define the grammar space for tesseract to work on.
2) eng.cube.bigrams and eng.cube.word-freq - not sure how these
are being used currently and to what effect.

3) Is there a way of customizing the above and using it in tesseract
(I would assume that this will be part of the eng.traineddata , but
when I split the same I do not find these files as its members)
e.g. instead of using the whitelist in the code , can we customize the
eng.cube.lm and use that instead to restrict the tesseract's character
output.

Regards,
Amrit.

Dmitri Silaev

unread,
Apr 18, 2011, 4:10:10 AM4/18/11
to tesser...@googlegroups.com, Amrit
Well, I may know no more than you do. You've probably found this
remark yourself, but some time ago Ray Smith casually mentioned
"Cube increases the accuracy slightly, but adds a lot of compute
time." (https://groups.google.com/d/msg/tesseract-ocr/0msQtTB_XrI/D1noel9GpPgJ)

I don't know if this is currently relevant, but as for me, I wouldn't
investigate much time in studying the Cube's behavior (at least for
the moment) as it certainly will undergo many substantial source code
corrections (this can even be found in the source code comments), as
will do the way of interaction between Tesseract and Cube. Currently
Tesseract segments everything itself and then passes segmented results
to Cube on the word-by-word basis. Then some selection happens for who
of the two did better OCR: Tess or Cube.

However if you still wish to dig, refer to "cube_control.cpp" and the
"cube" source directory.

HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

Amrit

unread,
Apr 18, 2011, 3:22:21 PM4/18/11
to tesseract-ocr
Thanks Dimitri,
As mentioned earlier, my expectation for using
tesseract in cube mode is hinged on the possibility of using some kind
of grammar/language model restrictions to the word recognition which
is happening.(eng.cube.lm ; eng.cube.bigrams etc)
My understanding of the tesseract recognition is
that the image text is segmented at a character level and stored as
blobs.These blobs are recognized individually with the help of
unicharset as per the given language.Further more word recognition
takes place based on the character input inside page_resit(assuming
iterator) -> page_res->werd_res ( containing the extra information
about the physical location of the word in the image)
What I am still looking for is, that in the
decoding of a single word there has to be a grammar/dictionary
associated with it so that tesseract validates as to whatever it has
recognized at a character level, when put together, actually
symbolizes a valid word.If the word is not found in the dict then the
result is what is obtained by character level recognition alone (This
is when the output is sometimes a group of random characters)
Do correct me if I am wrong in assuming the
above, but it'll really help me if I can get hold of this grammar/dict
if it is being used at all.It will enable me to restrict such random
results which I am observing in my image ocr output.

Regards,
Amrit.

Dmitri Silaev

unread,
Apr 20, 2011, 9:33:20 AM4/20/11
to tesser...@googlegroups.com, Amrit
Amrit,

First of all, did you train a new font using your source images? For
the image you've shown before, it's still a crucial stage to gain
success, be it with dictionary or without. Your postal address font is
very specific.

Simplistically, Tesseract's word matching is almost an exhaustive
enumeration of "chop" points. In other words, enumeration of connected
component partitions. Pixels between every pair of chop points are
thought as potential symbols and are being matched against trained
templates. Some best matches are saved and then "permuted" using
various methods to get possible word choices. Dictionary in some
degree is deemed as a "permuter".

I've made some basic checks for how dictionary is working in the
current revision, and from what I've seen I think it's fine. But if
your training glyphs are very different from those you are trying to
recognize, the dictionary permuter won't have any chance to come into
play.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

Reply all
Reply to author
Forward
0 new messages