> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
>
>
--
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”
I think I managed to miss mentioning it completely, but there's
nothing that *forces* that a word be recognised as a dictionary word;
it's just used to establish character confidences. Really, where you
see the difference is across a longer piece of text, when the adaptive
classifier has seen enough examples to know "hey, this thing I thought
was an 'f' might actually be a 't'". In short texts, there's not much
to adapt to. Making a bunch of training images, drawing boxfiles,
etc., only goes so far, so tess uses the dictionary as an
approximation; a low-confidence equivalent of training pages.
On the plus side, it turns out that there are functions buried in the
code to serialise/deserialise the classifier state, so it might be
useful to run a whole corpus of short images through tess in one
batch, save the state, and load that at startup.
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
That's true, but the results would have been more or less the same anyway.
Anyway; going by some of the stuff Google have published, there will
be a post-editing facility in Tesseract in the future, where the
dictionaries and something very much like DangAmbigs will be used in
more or less the way people expected that they were used.
It might actually be in the codebase now (hey, it's quite large, and I
don't have a huge amount of spare time), but I've only found the
training code (and that's not quite set up to be used yet).
On the plus side, it turns out that there are functions buried in the
code to serialise/deserialise the classifier state, so it might be
useful to run a whole corpus of short images through tess in one
batch, save the state, and load that at startup.
This commit has the conversion to doxygen of the documentation of some
of those functions:
http://code.google.com/p/tesseract-ocr/source/detail?r=447#
Seems right; the *_VAR and *_VAR_H declarations are usually
'balanced'. I put it back in in r448