Hi everyone,
I have been playing with Tesseract for Thai language for a while. The performance of the default LSTM model is good. However, I would like to know if I can further improve it.
First I have tried to retrain the model but ran into problems. I have tried to replace top layer without success neither. I think that it is due to unicharset (but I am not sure, I forgot the error messages). So I ended up training the model from scratch. Now I get a working model but I cannot reach the same performance as the default model. Please give some advice on how to improve the accuracy of the model.
Here is how I did it.
In total, 65 fonts were selected to train the new model.
I observed that lots of text in this file are gibberish. I think that the default model is built from this text file, so I used it as well.
I used 'text2image' to generate training data by varying 3 exposures (-1,0,1), 2 conditions (normal, degraded), and 2 dpi (300, 400). From 'tha.training_text' and 65 fonts, I obtained 900,000+ lines to train the model.
I observed that 'tha.traineddata' contains two unicharsets i.e. 'tha.unicharset' and 'tha.lstm-unicharset'. As I am interested in LSTM model, I replaced 'tha.lstm-unicharset' with the new unicharset generated from box files using 'unicharset_extractor'.
Noted that the help message of 'unicharset_extractor' says:
...
Where mode means:
1=combine graphemes (use for Latin and other simple scripts)
2=split graphemes (use for Indic/Khmer/Myanmar)
3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
However, as in Thai language, we have split graphemes that are "ะ", "แ", "ำ", "ญ", and "ฐ". So I called unicharset_extractor with "--norm_mode 2" instead of 3. I am not sure if this is correct setting for norm_mode.
Then I used 'combine_tessdata' to replace 'tha.lstm-unicharset' in 'tha.traineddata'.
I trained the model using 'lstmtraining --traineddata tha.traineddata --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c150]' ..."
I believe this means that I construct a NN with:
- input shape (1,36,36,1), i.e. batch size=1 (ignored), bitmap size 36x36 and 1 channel (grayscale)
- Convolution with tanh of size 3x3, 16 filters
- Maxpooling 3x3
- LSTM forward in y-direction and summarized the output into 48 values
- LSTM forward in x-direction with 96 outputs
- LSTM backward in x-direction with 96 outputs
- LSTM forward in x-direction with 256 outputs
- Output sequence of 150-dim vectors using softmax+CTC.
I have copied the model from somewhere on Internet and modified it. I still don't know what 'summarize' in LSTM actually means.
During the training I observed lots of warning messages such as
Encoding of string failed! Failure bytes: ffffffe0 ffffffb8 ffffff84 ffffffe0 ffffffb8 ffffffb8 ffffffe0 ffffffb8 ffffffa2 20 ffffffe0 ffffffb9 ffffff80 ffffffe0 ffffffb8 ffffff94 ffffffe0 ffffffb8 ffffffb5 ffffffe0 ffffffb8 ffffffa2 20 ffffffe0 ffffffb8 ffffffa3 ffffffe0 ffffffb8 ffffffb0 ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb8 ffffff9a ffffffe0 ffffffb9 ffffff91 ffffffe0 ffffffb9 ffffff99 20 37 37 20 ffffffe0 ffffffb9 ffffff81 ffffffe0 ffffffb8 ffffffa5 ffffffe0 ffffffb8 ffffffb0 ffffffe0 ffffffb8 ffffffa1 ffffffe0 ffffffb8 ffffffb5 2e 22 20 ffffffe0 ffffffb8 ffffffa1 ffffffe0 ffffffb8 ffffffb4 ffffffe0 ffffffb9 ffffff80 ffffffe0 ffffffb8 ffffffa1 ffffffe0 ffffffb8 ffffffb7 ffffffe0 ffffffb8 ffffffad ffffffe0 ffffffb8 ffffff87
Can't encode transcription: 'คุย เดีย ระบบ๑๙ 77 และมี." มิเมือง' in language ''
I don't know what causes this kind of warning and how to solve it so I just continue the training.
I trained the model for 10M iterations and obtain 'newtha.lstm_checkpoint' that I convert to 'newtha.traineddata' using
'lstmtraining --stop_training --continue_from newtha.lstm_checkpoint --traineddata tha.traineddata --model_output newtha.traineddata'.
Then I put 'newtha.traineddata' in '/usr/local/share/tessdata/' and call it with 'tesseract -l newtha ...'.
I tested this model on images captured from smartphone. The character-level accuracy is about 80% while the default model gives about 95% accuracy. During the test, I also observed that sometimes the new model strangely failed to recognize texts that seems to be easy as shown below.

What should I do next to improve the accuracy? Should I tried changing the structure of LSTM model or training with text with real meaning or adding more fonts and other degradations such as Gaussian blur or salt-and-pepper noise, etc.
Any suggestions are welcome and appreciated.
Thank you,
Sanparith