Hi,
I am trying to improve the model for Arabic language. Unfortunately, the results are not good enough. Probably I've been going in wrong directions. So I would like to have ideas from the community.
1. Training text:
The
training text used for training Arabic is very small: 80 lines vs 193k for english (see
langdata_lstm/issues/6), and does not contain all characters. So I tried to prepare a larger dataset.
- Concerning
arabic diacritics, should the training text contain all combinations of letters and diacritics ? For example, if the training text doesn't have the combination بَ (letter ب +
fatha), can the model recognise it after training ?
- How to regenerate the files ara.punc, ara.numbers, ara.wordlist, ara.config, ara.unicharset... ?
- By the way, most of files in
https://github.com/tesseract-ocr/langdata_lstm didn't change since 5 years ago. Is it open to contributions ? Is is possible to retrain some languages ?
2. Ground truth:
- I used text2image to generate 1k text lines multiplied by ~20 fonts = total of 20k images. Is it enough ?
- Should I use the box files generated by text2image ? or the WordStr format since it's a right-to-left language ?
4. Training:
I started from the existing model (START_MODEL=ara). Number of iterations is 20k. Is it enough ?
If you have any other suggestions/remarks, please share.
Thanks !