In Khmer language, there are 2 types of fonts.
1. Pre-Unicode (Limon) - 19 fonts such as Limon F1, Limon F2 ...etc and
2. Unicode fonts such as Khmer OS, Khmer OS Battambong ... etc
I tested using latest tesseract tessdata i.e khm.traineddata_fast and khm.traineddata_best (khm.traineddata_fast gaves better accuracy) and measured its accuracy using ISRI tools.
The average of Khmer Unicode Character and Cluster accuracy are more than 80% but for Pre-Unicode (Limon) is around 60%.
+ Most of the Khmer old (Law) documents [images,pdf] files were produced using Khmer Legacy fonts (Limon). so our goal is to Fine Tuning the existing Tesseract khm.traineddata with Khmer Pre-Unicode fonts. So the OCR engine can recognize those image files and produce output as unicode text with better accuracy.
A. Problem 1: Since I am naive to Tesseract and LSTM-Neural Network, I want to understand how Tesseract 4.0 LSTMtraining work so I follow the tesseract wiki tutorial and tried to trained lstm from the scratch using Khmer OS Font and tesseract's langdata for khmer.
Here are the command I used:
training/tesstrain.sh --fonts_dir /usr/share/fonts/truetype/khmeros-ttf --fontlist "Khmer OS" --langdata_dir langdata --lang khm --linedata_only --noextract_font_properties --tessdata_dir /home/phyrum/tesseract/tessdata --output_dir khmtrain
training/lstmtraining --debug_interval -1 \
--traineddata khmtrain/khm/khm.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output khmoutput/base --learning_rate 20e-4 \
--train_listfile khmtrain/khm.training_files.txt \
--max_iterations 10000 &> khmoutput/basetrain.log
=> After lstmtraining finished, I got only "basetrain.log" and "base_checkpoint" but I cannot find the final "khm.traineddata". and I really don't know why. Could you please help me?
Note I used Ubuntu 16.04 LTS and My Tesseract Version is 4.00.00dev-691-gfb359fc
B. Problem 2: In Pre-Unicode (Limon) fonts, each character is represented using Latin-based codes, not in unicode.
Process:
- I converted langdata/khm/khm.training_text and other files in langdata into Limon
- Run command training/tesstrain.sh and training/lstmtraining (the same as above command except parameter --fontlist and --fonts_dir is changed)
- The output of .box file and unicharset were in latin base so It cannot recognized output text as unicode text.
Is it the correct way to do it? Could you please give me some advices? or Is there any where in the Tesseract source code that should be modified so I can train by using input text as Limon format but the OCR recognized output is in khmer unicode text.
Thanks with best regards,
Phyrum