Hi Rob,
You're getting there, don't worry :)
On Fri, May 16, 2014 at 08:56:50AM -0700, Rob Stewart wrote:
> [snip]
> unicharset_extractor eng.FreeSans.exp0.box
>
> set_unicharset_properties -U unicharset -O unicharset.out --script_dir=../
> tesseract-ocr-read-only/training/langdata
>
> shapeclustering -F font_properties -U unicharset
eng.FreeSans.exp0.tr
> #shapeclustering -F font_properties -U unicharset.out
eng.FreeSans.exp0.tr
>
> mftraining -F font_properties -U unicharset -O
eng.FreeSans.exp0.tr
> #mftraining -F font_properties -U unicharset.out -O
eng.FreeSans.exp0.tr
>
> #cntraining
eng.FreeSans.exp0.tr
> Once I get down to shaperclustering I can't tell from the documentation which
> unicharset file to use the first one produced or the one produced by the
> 'set_unicharset_properties' command.
The one produced by set_unicharset_properties is always better to
use, as it should have correct attributes for each character.
Note that shapeclustering is generally not recommended for most
scripts (I think it's just devanagari scripts that it's used for at
the moment). I tested with and without for my grc training, and
results were far better without it.
> Either way the mftraining usually fails, sometimes a second attempt at running
> shapeclustering and mftraining outside of this shell file works, but almost
> every time I get the following error...
You're calling mftraining slightly incorrectly. The -O argument is
for the resulting unicharset, not the .tr file; tesseract is
probably getting upset at you overwriting the .tr with a unicharset
file while (or maybe even before) reading it. In my grc makefile, I
call it like this:
mftraining -F font_properties -U grc.earlyunicharset -O grc.unicharset grc*tr
(grc.earlyunicharset is the output from set_unicharset_properties).
> Any help would be appreciated. Also I think adding this kind of shell script
> (or equivalent) to a 'fast start' for training could be useful.
You may find the Makefile from my grc repository helpful. Get it
with:
git clone
http://ancientgreekocr.org/grc.git
I decided to use a Makefile rather than a shell script so that I can
test changes and only the appropriate parts are re-run, rather than
everything.
Nick