Scripts to semi-automate training

894 views
Skip to first unread message

Derek Dohler

unread,
May 24, 2012, 3:02:53 PM5/24/12
to tesser...@googlegroups.com
Hi all,

I have been doing a lot of tesseract training recently, so I decided to put together some Python and shell scripts to speed up the process. I haven't done any prep to prepare these for public consumption, but they have made my life a lot easier, so I thought I'd throw them out on the list in case anyone else finds them useful.

Just a head's up, the default language is Georgian because that's what I'm training for, so make sure to change that to your language when training.

https://github.com/ddohler/tess_school

Cheers,
Derek

Sriranga(78yrsold)

unread,
May 25, 2012, 6:18:50 AM5/25/12
to tesser...@googlegroups.com
Windows users,
Downloaded python and shell scripts. I am using WinXP(sp3) and installed python 26 and python32. when tried to run text2image.py it displayed error message vide screenshots attached.
I am newbie to python, don't know how to solve the problem of cairo.
Requested valuable guidance.
With regards,
-sriranga(79yrs)



--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

2python.JPG
3cairo.JPG
python.JPG

Nick White

unread,
May 25, 2012, 6:00:34 AM5/25/12
to tesser...@googlegroups.com
Hi Derek,

Thanks for sharing the scripts. I've been doing similar stuff
myself; see https://www.dur.ac.uk/nick.white/tools/

Particularly useful and interesting is the masstrain.sh script,
which creates an image and box file for each font on the system
which has enough of the required glyphs available.

I have quite a few other shell scripts I'm using to do neat things,
but they aren't generally useful, at least yet.

I'm using these to supplement hand training, not to supplant it, but
semi-automating the training process is certainly a useful thing.

Feedback would be welcome, of course.

Nick

shikamuk

unread,
Jun 3, 2012, 2:29:26 PM6/3/12
to tesseract-ocr
Hey, Derek.
Thank you for scripts, they seem to work.

However, a couple of questions:

0. So, I've compiled svn version of tesseract and installed it to the /
local/tesseract-svn prefix with all language files.
I've also exported /local/tesseract-svn/bin in PATH so that binaries
from there can be called from scripts.

1. Then, I've created the text.txt file with a nice long text in it.

2. I've run
python text2img.py -b -i _some_fonts_here
Now I have png files.

3. Then I run png2tif.sh and get all tif files.
That's correct.

4. Then I am supposed to run autotrain.sh, right?
Anyway, it is failing on the first step - make_boxes.sh
I debugged the script by putting "set -x" there and I have

---
+ LANG=hye
+ for file in '*.tif'
++ basename hye.Dejavu_Serifbold.exp0.tif
+ filename=hye.Dejavu_Serifbold.exp0.tif
+ filename=hye.Dejavu_Serifbold.exp0
+ tesseract hye.Dejavu_Serifbold.exp0.tif hye.Dejavu_Serifbold.exp0 -l
hye batch.nochop makebox
Error opening data file /local/tesseract-svn/share/tessdata/
hye.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to
the parent directory of your "tessdata" directory.
Failed loading language 'hye'
Tesseract couldn't load any languages!
Could not initialize tesseract.
---

and the same messages for the all fonts.

Obviously, there is no hye.traineddata file there.
I wonder if it should be there on this step, when I am bootstrapping a
new language?

According to the http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
while bootstrapping a new language one has to issue:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l
yournewlanguage batch.nochop makebox

which is what make_boxes.sh script tries to do and what is failed from
the commandline as well:

$tesseract hye.DejaVu_Sansitalic.exp0.tif hye.DejaVu_Sansitalic.exp0 -
l hy batch.nochop makebox
Error opening data file /local/tesseract-svn/share/tessdata/
hy.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to
the parent directory of your "tessdata" directory.
Failed loading language 'hy'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Any ideas?

Derek

unread,
Jun 4, 2012, 12:03:52 PM6/4/12
to tesser...@googlegroups.com
Hi Shikamuk,

Hello from neighboring Georgia! You're exactly right, the issue is that you don't have hye.traineddata yet. For completely new character sets, you need to issue the tesseract command without "-l yournewlanguage". The line you're referring to is suggesting what to do after you have trained Tesseract on one font in your new language. Since you are training for a unique script, it doesn't really matter what you use as the language code; you will get equally bad results no matter what. 

I don't suggest using auto_train.sh at this stage; you will need to edit the boxfiles generated by make_boxes.sh before continuing the training process, so I suggest running make_boxes.sh on its own, and then using merge_boxes.py and align_boxfile.py along with manual editing to get the boxfiles in order before continuing with the training process. I've made some small modifications to the scripts and README to make this clear, so I suggest doing 'git pull' to get the latest copy.

Hope that helps!

Derek

shikamuk

unread,
Jun 6, 2012, 2:26:40 PM6/6/12
to tesseract-ocr
> Hello from neighboring Georgia!
Yay!
Thank you, I'll do a git pull and give it a try!
Not right now, cause I am under the load.

I've also already noticed that without "-l" I can get it work.
Thank you again, I guess I may have further questions.

მადლობტ

ლორაირ

Sriranga(78yrs)

unread,
Jun 23, 2012, 4:12:41 AM6/23/12
to tesser...@googlegroups.com
will work in windows winXP provided if you have installed python2.7 and PIL.

On Sat, Jun 23, 2012 at 1:21 PM, blavatsky3 <nine.eleven.is...@gmail.com> wrote:
Hello Balthazar,

Do you have a 'windows' version of your box trainer for tesseract ?

kind regards

Richard


On Saturday, June 9, 2012 4:17:11 AM UTC+10, Balthazar Rouberol wrote:
Hello all,

I've written a small Python tool taking over the training process, and also the tif (multipage supported) and boxfile generation: https://github.com/BaltoRouberol/TesseractTrainer

This can be useful when you want to train Tesseract on a given font, and you thus have to create the tif yourself.
With this tool, you specify a text, a font (among other things) and a multipage tif containing your text/font will then be generated, along with the corresponding boxfile.
This allows you to be 100% sure of the boxfile accuracy, and skip the boxfile checking process.
The training process can now be fully automated, from the tif generation to the traineddata file combination.

I'll be happy to get feedback from you!

Balthazar

--
Reply all
Reply to author
Forward
0 new messages