General question about tesseract 5 and poor BCER for farsi language

497 views

Skip to first unread message

maedeh kafiyan safari

unread,

Oct 3, 2022, 4:06:43 AM10/3/22

to tesseract-ocr

Hi everyone,

I have been playing with Tesseract for farsi language for a while. The performance of the default LSTM model is good. However, I would like to know if I can further improve it. So I tried to develop it from scratch since I face with some unicahr errors.

before I talk about my problem with training, I have some general question which was not explain in tesseract paper and documents or at least I couldn't find it.

First question:

I want to know more about the features of data that tesseract trained on it. Are there any differences between this data on tesseract 5 and 4? are they just line? are they contain noise? Is there any connection and dependency between the word of each line?

second question:

After I searched, I found that the default batch size is 1. Does it mean that the tesseract 5 trained with batch size 1? How can I change it?

third question:

As I didn't get high accuracy, I decided to fine-tune the fas model by using START_MODEL command. But when I checked the lstmtraining --help, I found continue_from command, and now I am confused about what command I should use for fine-tuning.

forth question:

I am training tesseract version 5.2.0 from scratch with about 40000 data and 3 new fonts for the Farsi language. Although it seems that every step is correct, I got a high error rate BCER starting from 99.76 to 91.93 after 10000 iterations.

I want to know the reason behind this poor CER I got?

!export OMP_THREAD_LIMIT=16

!make training \

START_MODEL=fas \

MODEL_NAME=dori \

LANG_TYPE=RTL \

LANG_CODE=fas \

TESSDATA=/usr/share/tesseract-ocr/5/tessdata \

DATA_DIR=../data \ MAX_ITERATIONS=10000 `

Reply all

Reply to author

Forward

0 new messages