General question about tesseract 5 and poor BCER for farsi language

497 views
Skip to first unread message

maedeh kafiyan safari

unread,
Oct 3, 2022, 4:06:43 AM10/3/22
to tesseract-ocr
Hi everyone,

I have been playing with Tesseract for farsi language for a while. The performance of the default LSTM model is good. However, I would like to know if I can further improve it. So I tried to develop it from scratch since I face with some unicahr errors.

before I talk about my problem with training, I have some general question which was not explain in tesseract paper and documents or at least I couldn't find it.

First question:
I want to know more about the features of data that tesseract trained on it. Are there any differences between this data on tesseract 5 and 4? are they just line? are they contain noise? Is there any connection and dependency between the word of each line?  

second question:
After I searched, I found that the default batch size is 1. Does it mean that the tesseract 5 trained with batch size 1? How can I change it?

third question: 
As I didn't get high accuracy, I decided to fine-tune the fas model by using START_MODEL command. But when I checked the lstmtraining --help, I found continue_from command, and now I am confused about what command I should use for fine-tuning.  

forth question:
I am training tesseract version 5.2.0 from scratch with about 40000 data and 3 new fonts for the Farsi language. Although it seems that every step is correct, I got a high error rate BCER starting from 99.76 to 91.93 after 10000 iterations.
 I want to know the reason behind this poor CER I got?

!export OMP_THREAD_LIMIT=16 
 !make training \ 
 START_MODEL=fas \
 MODEL_NAME=dori \
 LANG_TYPE=RTL \
 LANG_CODE=fas \
 TESSDATA=/usr/share/tesseract-ocr/5/tessdata \
 DATA_DIR=../data \ MAX_ITERATIONS=10000 `  

  
Reply all
Reply to author
Forward
0 new messages