Hi everyone,
I have been playing with Tesseract for farsi language for a while. The performance of the default LSTM model is good. However, I would like to know if I can further improve it. So I tried to develop it from scratch since I face with some unicahr errors.
before I talk about my problem with training, I have some general question which was not explain in tesseract paper and documents or at least I couldn't find it.
First question:
I want to know more about the features of data that tesseract trained on
it. Are there any differences between this data on tesseract 5 and 4?
are they just line? are they contain noise? Is there any connection and
dependency between the word of each line?
second question:
After I searched, I found that the default batch size is 1. Does it mean
that the tesseract 5 trained with batch size 1? How can I change it?
third question:
As I didn't get high accuracy, I decided to fine-tune the fas model by using START_MODEL command. But when I checked the lstmtraining --help, I found continue_from command, and now I am confused about what command I should use for fine-tuning.
forth question:
I am training tesseract version 5.2.0 from scratch with about 40000 data and 3 new fonts for the Farsi language. Although it seems that every step is correct, I got a high error rate BCER starting from 99.76 to 91.93 after 10000 iterations.
I want to know the reason behind this poor CER I got?
!export OMP_THREAD_LIMIT=16
!make training \
START_MODEL=fas \
MODEL_NAME=dori \
LANG_TYPE=RTL \
LANG_CODE=fas \
TESSDATA=/usr/share/tesseract-ocr/5/tessdata \
DATA_DIR=../data \
MAX_ITERATIONS=10000 `