Why are the results of lstmeval and tesseract different?

J L

unread,

Oct 10, 2019, 1:56:28 AM10/10/19

to tesseract-ocr

My system info:

- OS: Ubuntu Desktop 18.04 LTS (4.15.0-55-generic)

Hi.

I am beginner and am trying to train some Korean character images for Korean recognition.

To understand how to train with Tesseract 4.0 LSTM, I followed Tesstrain.

I followed lines of Makefile in the Tesstrain step by step, and most of steps seemed to work fine until creating traineddata.

In detail:

1. I made box files and unicharset by following this lines.

2. I made lstmf files by following this lines.

3. I made two split file lists for training and evaluation by following this lines.

4. Before combining lang model, I downloaded radical-stroke.txt by following this line, and 3 langdata files (kor.punc, kor.numbers, and kor.wordlist) from this link.

I didn't download kor.config file because it cause an error that chi_tra.traineddata is needed.

5. I combined lang model by following this lines.

6. Then I started LSTM training by following this lines.

7. I tested them. The results are like:

lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmeval --traineddata data/kor/kor.traineddata --model data/kor/checkpoints/kor_checkpoint --eval_listfile data/kor/list.eval

data/kor/checkpoints/kor_checkpoint is not a recognition model, trying training checkpoint...

Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp249.lstmf

Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp228.lstmf

Truth:먹

OCR :이

Truth:독

OCR :이

Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp197.lstmf

Loaded 1/1 lines (1-1) of document data/ground-truth/kor.malgun.exp41.lstmf

Truth:파

OCR :이

Truth:신

OCR :열

... (skip)

At iteration 0, stage 0, Eval Char error rate=133.33333, Word error rate=96.875

There seems to be no problem with the results.

8. I made traineddata output file.

lim@ubuntu:~/tools/tesstrain$ usr/bin/lstmtraining --stop_training \

--continue_from data/kor/checkpoints/kor_checkpoint \

--traineddata data/kor/kor.traineddata \

--model_output usr/share/tessdata/kor.traineddata

9. Then I used tesseract with kor.malgun.exp197.tif. the TIF file was shown to '이' when I followed step 7 (testing with lstmeval). So I expected the same result.

lim@ubuntu:~/tools/tesstrain$ usr/bin/tesseract data/ground-truth/kor.malgun.exp197.tif stdout -l kor --psm 6 > result

But the real result was totally mess. It's the result:

Why the results of `lstmeval` and `tesseract` are different?

Thank you...

Shree Devi Kumar

unread,

Oct 10, 2019, 7:23:09 AM10/10/19

to tesseract-ocr

I suggest that you open issue in tesstrain repo.

The makefile does training from scratch. Is that what you wanted? Do you have a large enough training text - how many lines? How many iterations for training?

Eval Char error rate=133.33333, Word error rate=96.875

That is a very high error rate. You need to get it down to 0%.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

J L

unread,

Oct 10, 2019, 9:08:14 AM10/10/19

to tesseract-ocr

Okay, I will do as you suggested.

Thank you for answering my question.

2019년 10월 10일 목요일 오후 8시 23분 9초 UTC+9, shree 님의 말:

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/074b17ee-cb7c-49a2-a653-1180f6190254%40googlegroups.com.

Reply all

Reply to author

Forward