Experiment with Thai language

sanparith marukatat

unread,

Aug 31, 2018, 3:47:59 AM8/31/18

to tesseract-ocr

Hi everyone,

I have been playing with Tesseract for Thai language for a while. The performance of the default LSTM model is good. However, I would like to know if I can further improve it.

First I have tried to retrain the model but ran into problems. I have tried to replace top layer without success neither. I think that it is due to unicharset (but I am not sure, I forgot the error messages). So I ended up training the model from scratch. Now I get a working model but I cannot reach the same performance as the default model. Please give some advice on how to improve the accuracy of the model.

Here is how I did it.

I used common Thai fonts (Tahoma, Sarabun, Angsana, Browallia, Cordia, Dillenia, Iris) with fonts arbitrary picked from http://www.thaisignmaker.com/korkhorkore/?catalog/all/-/date/1

In total, 65 fonts were selected to train the new model.

I downloaded Thai training text, i.e. 'tha.training_text', from https://github.com/tesseract-ocr/langdata/blob/master/tha/tha.training_text

I observed that lots of text in this file are gibberish. I think that the default model is built from this text file, so I used it as well.

I used 'text2image' to generate training data by varying 3 exposures (-1,0,1), 2 conditions (normal, degraded), and 2 dpi (300, 400). From 'tha.training_text' and 65 fonts, I obtained 900,000+ lines to train the model.

I downloaded 'tha.traineddata' from https://github.com/tesseract-ocr/tessdata

I observed that 'tha.traineddata' contains two unicharsets i.e. 'tha.unicharset' and 'tha.lstm-unicharset'. As I am interested in LSTM model, I replaced 'tha.lstm-unicharset' with the new unicharset generated from box files using 'unicharset_extractor'.

Noted that the help message of 'unicharset_extractor' says:

...

Where mode means:

1=combine graphemes (use for Latin and other simple scripts)

2=split graphemes (use for Indic/Khmer/Myanmar)

3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

However, as in Thai language, we have split graphemes that are "ะ", "แ", "ำ", "ญ", and "ฐ". So I called unicharset_extractor with "--norm_mode 2" instead of 3. I am not sure if this is correct setting for norm_mode.

Then I used 'combine_tessdata' to replace 'tha.lstm-unicharset' in 'tha.traineddata'.

I trained the model using 'lstmtraining --traineddata tha.traineddata --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c150]' ..."

I believe this means that I construct a NN with:

- input shape (1,36,36,1), i.e. batch size=1 (ignored), bitmap size 36x36 and 1 channel (grayscale)

- Convolution with tanh of size 3x3, 16 filters

- Maxpooling 3x3

- LSTM forward in y-direction and summarized the output into 48 values

- LSTM forward in x-direction with 96 outputs

- LSTM backward in x-direction with 96 outputs

- LSTM forward in x-direction with 256 outputs

- Output sequence of 150-dim vectors using softmax+CTC.

I have copied the model from somewhere on Internet and modified it. I still don't know what 'summarize' in LSTM actually means.

(https://github.com/tesseract-ocr/tesseract/wiki/VGSLSpecs)

During the training I observed lots of warning messages such as

Encoding of string failed! Failure bytes: ffffffe0 ffffffb8 ffffff84 ffffffe0 ffffffb8 ffffffb8 ffffffe0 ffffffb8 ffffffa2 20 ffffffe0 ffffffb9 ffffff80 ffffffe0 ffffffb8 ffffff94 ffffffe0 ffffffb8 ffffffb5 ffffffe0 ffffffb8 ffffffa2 20 ffffffe0 ffffffb8 ffffffa3 ffffffe0 ffffffb8 ffffffb0 ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb9 ffffff91 ffffffe0 ffffffb9 ffffff99 20 37 37 20 ffffffe0 ffffffb9 ffffff81 ffffffe0 ffffffb8 ffffffa5 ffffffe0 ffffffb8 ffffffb0 ffffffe0 ffffffb8 ffffffa1 ffffffe0 ffffffb8 ffffffb5 2e 22 20 ffffffe0 ffffffb8 ffffffa1 ffffffe0 ffffffb8 ffffffb4 ffffffe0 ffffffb9 ffffff80 ffffffe0 ffffffb8 ffffffa1 ffffffe0 ffffffb8 ffffffb7 ffffffe0 ffffffb8 ffffffad ffffffe0 ffffffb8 ffffff87

Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in language ''

I don't know what causes this kind of warning and how to solve it so I just continue the training.

I trained the model for 10M iterations and obtain 'newtha.lstm_checkpoint' that I convert to 'newtha.traineddata' using

'lstmtraining --stop_training --continue_from newtha.lstm_checkpoint --traineddata tha.traineddata --model_output newtha.traineddata'.

Then I put 'newtha.traineddata' in '/usr/local/share/tessdata/' and call it with 'tesseract -l newtha ...'.

I tested this model on images captured from smartphone. The character-level accuracy is about 80% while the default model gives about 95% accuracy. During the test, I also observed that sometimes the new model strangely failed to recognize texts that seems to be easy as shown below.

What should I do next to improve the accuracy? Should I tried changing the structure of LSTM model or training with text with real meaning or adding more fonts and other degradations such as Gaussian blur or salt-and-pepper noise, etc.

Any suggestions are welcome and appreciated.

Thank you,

Sanparith

Shree Devi Kumar

unread,

Aug 31, 2018, 4:29:21 AM8/31/18

to tesser...@googlegroups.com

A few points to note:

1. langdata repo has training data for 3.04. please use langdata_lstm repo for training data for LSTM training.

2. To train from existing models, you need to use traineddata files from tessdata_best repo.

3. Use tesstrain.sh script to create the starter traineddata file to be used for training.

4. Build the latest beta.4 code from github and use that.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/78a0624a-c9ca-43c1-bd64-077bf0301e8b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Aug 31, 2018, 4:33:25 AM8/31/18

to tesser...@googlegroups.com

>Then I used 'combine_tessdata' to replace 'tha.lstm-unicharset' in 'tha.traineddata

This will cause problems as the dawgs, lstm model and recoder in the trained data will be using the old lstm-unicharset.

Screen Shot 2561-08-31 at 11.19.53.png

sanparith marukatat

unread,

Aug 31, 2018, 5:11:44 AM8/31/18

to tesseract-ocr

Thanks :)

Shree Devi Kumar

unread,

Aug 31, 2018, 8:14:58 AM8/31/18

to tesser...@googlegroups.com

>Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in language ''

I don't know what causes this kind of warning and how to solve it so I just continue the training.

These are related to normalization and validation of the training text. Please see https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validate_grapheme.cpp for the rules applied for Thai.

Reply all

Reply to author

Forward