Simplest steps to train tesseract

Vincent

unread,

Jan 9, 2008, 1:16:43 PM1/9/08

to tesseract-ocr

I learned some here, so I think I should pay back some:
The following is the simplest steps to train tesseract, more details
see link: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.

Steps of training tessertact:

1. Generate Training Images.
Print the sample document out and scan it with 300DPI TIF B/W image,
say image name "scan.tif"

2. Make Box Files
Run "tesseract scan.tif scan batch.nochop makebox";
This will generate file "scan.txt", check this file to correct the
mistakes, then rename "scan.txt" to "scan.box";

3. Run Tesseract for Training
Run "tesseract scan.tif junk nobatch box.train";
This will generate file "scan.tr";

4. Clustering
Run "mftraining scan.tr";
This will generate file "inttemp", "pffmtable" and "Microfeat"(Not
used);

Run "cnTraining scan.tr";
This will generate file "normproto";

5.Compute the Character Set
Run "unicharset_extractor scan.box";
This will generate file "unicharset"

6.Dictionary Data
Create two UTF-8 text file, "frequent_words_list" and "words_list",
the words in the files should not be duplicated;
Run "wordlist2dawg frequent_words_list freq-dawg"
Run "wordlist2dawg words_list word-dawg";
This will generate two files, "freq-dawg" and "word-dawg";

7. Putting it all together
All you need to do now is collect together all 8 files and rename
them with a lang. prefix;
File "eng.DangAmbigs" and "eng.user-words" could be empty;
If create "eng.DangAmbigs" file, the characters must be exist in the
"scan.box";

8. Try it
Run "tesseract scan.tif output -l eng"
The file "output.txt" is the result;

Vincent

unread,

Jan 9, 2008, 1:26:25 PM1/9/08

to tesseract-ocr

However, I still have one problem:

tessdll can not be called twice, after first call, it will crash at
second time.

Workaround: Every time make a fresh call of tessdll. I mean load
tessdll into memory every time when you use it, then destroy it at
all.

Any other solution?

Ray Smith

unread,

Jan 9, 2008, 10:18:15 PM1/9/08

to tesser...@googlegroups.com

Yes! As luck would have it, I have just been working on fixing problems with multiple calls to init/end, so the next release will fix this problem.
Ray.

Luca

unread,

Jan 12, 2008, 5:16:58 PM1/12/08

to tesseract-ocr

comment out //first_time = 0;
in wordrec/msmenus.cpp line 69
cheers

Ray Smith

unread,

Jan 12, 2008, 9:04:57 PM1/12/08

to tesser...@googlegroups.com

Thanks, that stops the crash, but there are also memory leaks if you make a mistake and call init multiple times and it crashes if you call end twice and I have fixed those too.

Vincent

unread,

Jan 14, 2008, 9:42:41 AM1/14/08

to tesseract-ocr

Thanks Luca, great job and it does work.

Vincent

> > Any other solution?- Hide quoted text -
>
> - Show quoted text -

Reply all

Reply to author

Forward