LSTMTRAINING from the scratch for khmer language

phyrum sk

unread,

Nov 21, 2017, 6:47:14 AM11/21/17

to tesseract-ocr

In Khmer language, there are 2 types of fonts.

1. Pre-Unicode (Limon) - 19 fonts such as Limon F1, Limon F2 ...etc and

2. Unicode fonts such as Khmer OS, Khmer OS Battambong ... etc

I tested using latest tesseract tessdata i.e khm.traineddata_fast and khm.traineddata_best (khm.traineddata_fast gaves better accuracy) and measured its accuracy using ISRI tools.

The average of Khmer Unicode Character and Cluster accuracy are more than 80% but for Pre-Unicode (Limon) is around 60%.

+ Most of the Khmer old (Law) documents [images,pdf] files were produced using Khmer Legacy fonts (Limon). so our goal is to Fine Tuning the existing Tesseract khm.traineddata with Khmer Pre-Unicode fonts. So the OCR engine can recognize those image files and produce output as unicode text with better accuracy.

A. Problem 1: Since I am naive to Tesseract and LSTM-Neural Network, I want to understand how Tesseract 4.0 LSTMtraining work so I follow the tesseract wiki tutorial and tried to trained lstm from the scratch using Khmer OS Font and tesseract's langdata for khmer.

Here are the command I used:

training/tesstrain.sh --fonts_dir /usr/share/fonts/truetype/khmeros-ttf --fontlist "Khmer OS" --langdata_dir langdata --lang khm --linedata_only --noextract_font_properties --tessdata_dir /home/phyrum/tesseract/tessdata --output_dir khmtrain

training/lstmtraining --debug_interval -1 \

--traineddata khmtrain/khm/khm.traineddata \

--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \

--model_output khmoutput/base --learning_rate 20e-4 \

--train_listfile khmtrain/khm.training_files.txt \

--max_iterations 10000 &> khmoutput/basetrain.log

=> After lstmtraining finished, I got only "basetrain.log" and "base_checkpoint" but I cannot find the final "khm.traineddata". and I really don't know why. Could you please help me?

Note I used Ubuntu 16.04 LTS and My Tesseract Version is 4.00.00dev-691-gfb359fc

B. Problem 2: In Pre-Unicode (Limon) fonts, each character is represented using Latin-based codes, not in unicode.

Process:

- I converted langdata/khm/khm.training_text and other files in langdata into Limon

- Run command training/tesstrain.sh and training/lstmtraining (the same as above command except parameter --fontlist and --fonts_dir is changed)

- The output of .box file and unicharset were in latin base so It cannot recognized output text as unicode text.

Is it the correct way to do it? Could you please give me some advices? or Is there any where in the Tesseract source code that should be modified so I can train by using input text as Limon format but the OCR recognized output is in khmer unicode text.

Thanks with best regards,

Phyrum

tesseract_version.png

training_lstmtraining_tesseract.png

training_tesseract.png

Li Xianglei

unread,

Nov 22, 2017, 10:35:46 PM11/22/17

to tesseract-ocr

Hi, I'm new to tesseract too, and also working on the fine-tuning .

Wish this could do any help to you.

=> After lstmtraining finished, I got only "basetrain.log" and "base_checkpoint" but I cannot find the final "khm.traineddata". and I really don't know why. Could you please help me?

You can use the stop_training flag as following to create traineddata

lstmtraining --stop_training --continue_from trainhalfwidth/jpnhw_checkpoint --traineddata tessdata/jpn.traineddata --model_output trainhalfwidth/jpnhw.traineddata

Besides, I suggest you can firstly try the "Fine Tuning for Impact",

the "from the scratch" seems did not works for me, cause the langdata provided is for tesseract 3 training.

ps. Sorry for my bad english.

在 2017年11月21日星期二 UTC+8下午7:47:14，phyrum sk写道：

phyrum sk

unread,

Nov 24, 2017, 3:58:28 AM11/24/17

to tesseract-ocr

I can generate final traineddata now. Thanks very much Li Xianglei.

Reply all

Reply to author

Forward

LSTMTRAINING from the scratch for khmer language - Legacy Limon Fonts

phyrum sk

Li Xianglei

phyrum sk