How to make training for Arabic in Tesseract 4.0

Ahmad Moawad

unread,

Apr 8, 2017, 4:52:25 AM4/8/17

to tesseract-ocr

Hello All,

I want to make training for Arabic language in Tesseract 4.0, and The result of this version is great but still need some tunning, so I got jTessBoxEditor 2.0 beta.
I tried to modify the incorrect characters and build ara.traineddata. After copying the ara.traineddata to /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run the tesseract on the image.
So any suggestion of how making training for Version 4.0, I already know that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting until Ray makes another updated ara.traineddata.

,Thanks.

Quan Nguyen

unread,

Apr 10, 2017, 11:27:30 PM4/10/17

to tesseract-ocr

jTessBoxEditor 2.0 beta versions bundle the latest Tesseract 4.00alpha training executable. The training process for 4.00, however, has not been integrated to the program. The 3.0x training process is still supported.

Check out the two videos that depict the 3.0x training process:

https://wn.com/training_tesseract_ocr_for_arabic_language_tutorial

bmwmine

unread,

May 1, 2017, 10:00:11 AM5/1/17

to tesseract-ocr

I think jTessBoxEditor 2.0 has been updated to include Tesseract 4.00dev.

1- Could any body confirm because I am not getting better results for Arabic using it.

2- How much training set is sufficient to have best results for a new font e.g how many tiff pages.

Ibr

unread,

May 4, 2017, 5:28:13 AM5/4/17

to tesseract-ocr

if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if you compiled them in the correct way and got the binaries that you need for training lmstf files, then I recommend to follow the suggestions that is made by tesseract devs which is: once you create an .lstmf file for a certain font (that can be used for Arabic writing) then get the official ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf file in tesseract folder and run the command tesseract text_image result_text -l ara --oem 1

what Arabic characters exactly are you trying to enhance the accuracy for ?

Ahmad Moawad

unread,

May 4, 2017, 5:49:01 AM5/4/17

to tesseract-ocr

My Scenario is related to make training from images not from text base, I want to finetune characters such as:
لمجرد not ملجرد
and soon on

ShreeDevi Kumar

unread,

May 4, 2017, 5:52:42 AM5/4/17

to tesser...@googlegroups.com

Ibr,

You are incorrect in your description of LSTM training.

What you are doing will use the ara.traineddata provided in the repo, there will be no change in output.

Once lstmf files are created, you have to run lstmtraining which will run for days/weeks to give you a good result.

Please read about LSTM training on wiki.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ahmad Moawad

unread,

May 4, 2017, 5:52:52 AM5/4/17

to tesseract-ocr

for jTessBoxEditor 2.0 I tried it, but I didn't get any result !!
for your question How much training set is sufficient to have best results for a new font e.g how many tiff pages.
I think this was mention in Wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Ibr

unread,

May 4, 2017, 7:06:22 AM5/4/17

to tesseract-ocr

while I was creating lstmf files to I can use them in recognition text images I fount that some of the characters are recognized in a wrong way, some of them are not integrated in the tesseract and some them are due to some writing in certain Arabic itself,

in this case the tesseract acts correct but the font in Arabic is different scripting, but in the other case, the tesseract makes mistake in detecting the characters

both cases are described in this issue that I made few days ago

https://github.com/tesseract-ocr/tesseract/issues/840

Ahmad Moawad

unread,

May 4, 2017, 7:47:12 AM5/4/17

to tesseract-ocr

Ibr give me your email!

Message has been deleted

Ahmad Moawad

unread,

May 4, 2017, 8:06:34 AM5/4/17

to tesseract-ocr

check ur email

On Thursday, May 4, 2017 at 1:51:04 PM UTC+2, Ibr wrote:

ibr.h...@gmail.com

Ibr

unread,

May 4, 2017, 8:07:07 AM5/4/17

to tesseract-ocr

i shree

actually I saw the section that was talking about lstmtraining, but I what I said was the result of following the tesseract messages, what happened from the beginning was that I used to train .traineddata files for English, and worked fine, but for Arabic it was failing, so I saw the argument oem at tesseract and I used it then the tesseract asked for the lstm file, then I came across the article about the tesseract 4.00alpha which was including Arabic.

then I created the lstm file but again the tesseract failed at detecting the text from the image, I felt that the old .traineddata (created by tesseract 3.03) compatible with lstmf file, search for the cause of the problem and I found this issue, got the official traineddata and the accuracy for detecting Arabic text image was correct except for the characters that I described in the issue that I referred earlier.

if I'm not mistaken the lstmtraining section is to enhance the accuracy, correct?

it seems that if the لا case and الم case are solved in the ara.traineddata the accuracy of Arabic detecting will be as good as English detection

On Thursday, May 4, 2017 at 12:52:42 PM UTC+3, shree wrote:

Ibr,

You are incorrect in your description of LSTM training.

What you are doing will use the ara.traineddata provided in the repo, there will be no change in output.

Once lstmf files are created, you have to run lstmtraining which will run for days/weeks to give you a good result.

Please read about LSTM training on wiki.

On May 4, 2017 2:58 PM, "Ibr" <ibr.h...@gmail.com> wrote:

if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if you compiled them in the correct way and got the binaries that you need for training lmstf files, then I recommend to follow the suggestions that is made by tesseract devs which is: once you create an .lstmf file for a certain font (that can be used for Arabic writing) then get the official ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf file in tesseract folder and run the command tesseract text_image result_text -l ara --oem 1
what Arabic characters exactly are you trying to enhance the accuracy for ?

On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
Hello All,

I want to make training for Arabic language in Tesseract 4.0, and The result of this version is great but still need some tunning, so I got jTessBoxEditor 2.0 beta.
I tried to modify the incorrect characters and build ara.traineddata. After copying the ara.traineddata to /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run the tesseract on the image.
So any suggestion of how making training for Version 4.0, I already know that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting until Ray makes another updated ara.traineddata.

,Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Ibr

unread,

May 4, 2017, 8:07:58 AM5/4/17

to tesseract-ocr

ibr.h...@gmail.com

Ibr

unread,

May 4, 2017, 8:08:51 AM5/4/17

to tesseract-ocr

ibr.h...@gmail.com

Ibr

unread,

May 4, 2017, 8:44:05 AM5/4/17

to tesseract-ocr

replied to it

manu pranay

unread,

Jan 28, 2020, 1:38:47 AM1/28/20

to tesseract-ocr

shree,

can you please help me out how to perform arabic training on tesseract 4.

thank you

On Thursday, May 4, 2017 at 3:22:42 PM UTC+5:30, shree wrote:

Ibr,

You are incorrect in your description of LSTM training.

What you are doing will use the ara.traineddata provided in the repo, there will be no change in output.

Once lstmf files are created, you have to run lstmtraining which will run for days/weeks to give you a good result.

Please read about LSTM training on wiki.

On May 4, 2017 2:58 PM, "Ibr" <ibr....@gmail.com> wrote:

if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if you compiled them in the correct way and got the binaries that you need for training lmstf files, then I recommend to follow the suggestions that is made by tesseract devs which is: once you create an .lstmf file for a certain font (that can be used for Arabic writing) then get the official ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf file in tesseract folder and run the command tesseract text_image result_text -l ara --oem 1
what Arabic characters exactly are you trying to enhance the accuracy for ?

On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
Hello All,

I want to make training for Arabic language in Tesseract 4.0, and The result of this version is great but still need some tunning, so I got jTessBoxEditor 2.0 beta.
I tried to modify the incorrect characters and build ara.traineddata. After copying the ara.traineddata to /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run the tesseract on the image.
So any suggestion of how making training for Version 4.0, I already know that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting until Ray makes another updated ara.traineddata.

,Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,

Jan 28, 2020, 3:45:39 AM1/28/20

to tesseract-ocr

Please see https://github.com/Shreeshrii/tesstrain-ckb

This is for finetune training from script/Arabic, using text and fonts.

You would need to do steps similar to

https://github.com/Shreeshrii/tesstrain-ckb/blob/master/0-setup.sh

https://github.com/Shreeshrii/tesstrain-ckb/blob/master/2-txt2img.sh

https://github.com/Shreeshrii/tesstrain-ckb/blob/master/3-img2lstmf.sh

https://github.com/Shreeshrii/tesstrain-ckb/blob/master/4-train-layer.sh

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bf66a4e-f85f-4b87-bf82-5688cb2cac8a%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all

Reply to author

Forward