General guidelines for training Arabic

91 views
Skip to first unread message

Wael TELLAT

unread,
Aug 16, 2023, 12:17:49 AM8/16/23
to tesseract-ocr
Hi,
I am trying to improve the model for Arabic language. Unfortunately, the results are not good enough. Probably I've been going in wrong directions. So I would like to have ideas from the community.

1. Training text:
The training text used for training Arabic is very small: 80 lines vs 193k for english (see langdata_lstm/issues/6), and does not contain all characters. So I tried to prepare a larger dataset.
- Concerning arabic diacritics, should the training text contain all combinations of letters and diacritics ? For example, if the training text doesn't have the combination بَ (letter ب + fatha), can the model recognise it after training ?
- How to regenerate the files ara.punc, ara.numbers, ara.wordlist, ara.config, ara.unicharset... ?
- By the way, most of files in https://github.com/tesseract-ocr/langdata_lstm didn't change since 5 years ago. Is it open to contributions ? Is is possible to retrain some languages ?

2. Ground truth:
- I used text2image to generate 1k text lines multiplied by ~20 fonts = total of 20k images. Is it enough ?

- Should I use the box files generated by text2image ? or the WordStr format since it's a  right-to-left language ?

4. Training:
I started from the existing model (START_MODEL=ara). Number of iterations is 20k. Is it enough ?

If you have any other suggestions/remarks, please share.
Thanks !

Des Bw

unread,
Sep 8, 2023, 5:46:17 AM9/8/23
to tesseract-ocr
I am also starting up with Tesseract; and not an expert by no means. 
But, from what I learned from reading in various places: it might good for you to increase the number of lines to get better results. The iterations are sufficient for the first round. You can increase them step by step. 
Reply all
Reply to author
Forward
0 new messages