My system info:
- OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic)
Hi.
I am beginner and am trying to train some Korean character images for Korean recognition.
To understand how to train with Tesseract 4.0 LSTM, I followed Tesstrain.
I followed lines of Makefile in the Tesstrain step by step, and most of steps seemed to work fine until creating traineddata.
In detail:
1. I made box files and unicharset by following
this lines.
3. I made two split file lists for training and evaluation by following
this lines.
4. Before combining lang model, I downloaded radical-stroke.txt by following
this line, and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from
this link.
I didn't download kor.config file because it cause an error that chi_tra.traineddata is needed.
6. Then I started LSTM training by following
this lines.
7. I tested them. The results are like:
lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint --eval_listfile data/kor/list.eval
data/kor/checkpoints/kor_checkpoint is not a recognition model, trying training checkpoint...
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp249.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp228.lstmf
Truth:먹
OCR :이
Truth:독
OCR :이
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp197.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf
Truth:파
OCR :이
Truth:신
OCR :열
... (skip)
At iteration 0, stage 0, Eval Char error rate=133.33333, Word error rate=96.875
There seems to be no problem with the results.
8. I made traineddata output file.
lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \
--continue_from data/kor/checkpoints/kor_checkpoint \
--traineddata data/kor/kor.traineddata \
--model_output usr/share/tessdata/kor.traineddata
9. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown to '이' when I followed step 7 (testing with lstmeval). So I expected the same result.
lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result
But the real result was totally mess. It's the result:

Why the results of `lstmeval` and `tesseract` are different?
Thank you...