I learned some here, so I think I should pay back some:
The following is the simplest steps to train tesseract, more details
see link:
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.
Steps of training tessertact:
1. Generate Training Images.
Print the sample document out and scan it with 300DPI TIF B/W image,
say image name "scan.tif"
2. Make Box Files
Run "tesseract scan.tif scan batch.nochop makebox";
This will generate file "scan.txt", check this file to correct the
mistakes, then rename "scan.txt" to "scan.box";
3. Run Tesseract for Training
Run "tesseract scan.tif junk nobatch box.train";
This will generate file "
scan.tr";
4. Clustering
Run "mftraining
scan.tr";
This will generate file "inttemp", "pffmtable" and "Microfeat"(Not
used);
Run "cnTraining
scan.tr";
This will generate file "normproto";
5.Compute the Character Set
Run "unicharset_extractor scan.box";
This will generate file "unicharset"
6.Dictionary Data
Create two UTF-8 text file, "frequent_words_list" and "words_list",
the words in the files should not be duplicated;
Run "wordlist2dawg frequent_words_list freq-dawg"
Run "wordlist2dawg words_list word-dawg";
This will generate two files, "freq-dawg" and "word-dawg";
7. Putting it all together
All you need to do now is collect together all 8 files and rename
them with a lang. prefix;
File "eng.DangAmbigs" and "eng.user-words" could be empty;
If create "eng.DangAmbigs" file, the characters must be exist in the
"scan.box";
8. Try it
Run "tesseract scan.tif output -l eng"
The file "output.txt" is the result;