Hey, Derek.
Thank you for scripts, they seem to work.
However, a couple of questions:
0. So, I've compiled svn version of tesseract and installed it to the /
local/tesseract-svn prefix with all language files.
I've also exported /local/tesseract-svn/bin in PATH so that binaries
from there can be called from scripts.
1. Then, I've created the text.txt file with a nice long text in it.
2. I've run
python text2img.py -b -i _some_fonts_here
Now I have png files.
3. Then I run png2tif.sh and get all tif files.
That's correct.
4. Then I am supposed to run autotrain.sh, right?
Anyway, it is failing on the first step - make_boxes.sh
I debugged the script by putting "set -x" there and I have
---
+ LANG=hye
+ for file in '*.tif'
++ basename hye.Dejavu_Serifbold.exp0.tif
+ filename=hye.Dejavu_Serifbold.exp0.tif
+ filename=hye.Dejavu_Serifbold.exp0
+ tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 -l
hye batch.nochop makebox
Error opening data file /local/tesseract-svn/share/tessdata/
hye.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to
the parent directory of your "tessdata" directory.
Failed loading language 'hye'
Tesseract couldn't load any languages!
Could not initialize tesseract.
---
and the same messages for the all fonts.
Obviously, there is no hye.traineddata file there.
I wonder if it should be there on this step, when I am bootstrapping a
new language?
According to the
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
while bootstrapping a new language one has to issue:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l
yournewlanguage batch.nochop makebox
which is what make_boxes.sh script tries to do and what is failed from
the commandline as well:
$tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 -
l hy batch.nochop makebox
Error opening data file /local/tesseract-svn/share/tessdata/
hy.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to
the parent directory of your "tessdata" directory.
Failed loading language 'hy'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Any ideas?