How to make training for Arabic in Tesseract 4.0

3,197 views
Skip to first unread message

Ahmad Moawad

unread,
Apr 8, 2017, 4:52:25 AM4/8/17
to tesseract-ocr
Hello All,


I want to make training for Arabic language in Tesseract 4.0, and The result of this version is great but still need some tunning, so I got jTessBoxEditor 2.0 beta.
I tried to modify the incorrect characters and build ara.traineddata. After copying the ara.traineddata to /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run the tesseract on the image.
So any suggestion of how making training for Version 4.0, I already know that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting until Ray makes another updated ara.traineddata.

,Thanks.

Quan Nguyen

unread,
Apr 10, 2017, 11:27:30 PM4/10/17
to tesseract-ocr
jTessBoxEditor 2.0 beta versions bundle the latest Tesseract 4.00alpha training executable. The training process for 4.00, however, has not been integrated to the program. The 3.0x training process is still supported.

Check out the two videos that depict the 3.0x training process:

bmwmine

unread,
May 1, 2017, 10:00:11 AM5/1/17
to tesseract-ocr
I think jTessBoxEditor 2.0  has been updated to include Tesseract 4.00dev.
 
1- Could any body confirm because I am not getting better results for Arabic using it.

2- How much training set is sufficient to have best results for a new font e.g how many tiff pages.


Ibr

unread,
May 4, 2017, 5:28:13 AM5/4/17
to tesseract-ocr
if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if you compiled them in the correct way and got the binaries that you need for training lmstf files, then I recommend to follow the suggestions that is made by tesseract devs which is: once you create an .lstmf file for a certain font (that can be used for Arabic writing) then get the official ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf file in tesseract folder and run the command  tesseract text_image result_text -l ara --oem 1 
what Arabic characters exactly are you trying to enhance the accuracy for ?

Ahmad Moawad

unread,
May 4, 2017, 5:49:01 AM5/4/17
to tesseract-ocr
My Scenario is related to make training from images not from text base, I want to finetune characters such as:
لمجرد not ملجرد
and soon on

ShreeDevi Kumar

unread,
May 4, 2017, 5:52:42 AM5/4/17
to tesser...@googlegroups.com

Ibr,

You are incorrect in your description of LSTM training.

What you are doing will use the ara.traineddata provided in the repo, there will be no change in output.

Once lstmf files are created, you have to run lstmtraining which will run for days/weeks  to give you a good result.

Please read about LSTM training on wiki.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ahmad Moawad

unread,
May 4, 2017, 5:52:52 AM5/4/17
to tesseract-ocr
for jTessBoxEditor 2.0  I tried it, but I didn't get any result !!
for your question How much training set is sufficient to have best results for a new font e.g how many tiff pages.
I think this was mention in Wiki: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Ibr

unread,
May 4, 2017, 7:06:22 AM5/4/17
to tesseract-ocr
while I was creating lstmf files to I can use them in recognition text images I fount that some of the characters are recognized in a wrong way, some of them are not integrated in the tesseract and some them are due to some writing in certain Arabic itself,

in this case the tesseract acts correct but the font in Arabic is different scripting, but in the other case, the tesseract makes mistake in detecting the characters 
both cases are described in this issue that I made few days ago

Ahmad Moawad

unread,
May 4, 2017, 7:47:12 AM5/4/17
to tesseract-ocr
Ibr give me your email!
Message has been deleted

Ahmad Moawad

unread,
May 4, 2017, 8:06:34 AM5/4/17
to tesseract-ocr
check ur email

On Thursday, May 4, 2017 at 1:51:04 PM UTC+2, Ibr wrote:

Ibr

unread,
May 4, 2017, 8:07:07 AM5/4/17
to tesseract-ocr
i shree
actually I saw the section that was talking about lstmtraining, but I what I said was the result of following the tesseract messages, what happened from the beginning was that I used to train .traineddata files for English, and worked fine, but for Arabic it was failing, so I saw the argument oem at tesseract and I used it then the tesseract asked for the lstm file, then I came across the article about the tesseract 4.00alpha which was including Arabic.
then I created the lstm file but again the tesseract failed at detecting the text from the image, I felt that the old .traineddata (created by tesseract 3.03) compatible with lstmf file,  search for the cause of the problem and I found this issue, got the official traineddata and the accuracy for detecting Arabic text image was correct except for the characters that I described in the issue that I referred earlier.

if I'm not mistaken the lstmtraining section is to enhance the accuracy, correct?
it seems that if the لا case and الم case are solved in the ara.traineddata the accuracy of Arabic detecting will be as good as English detection


On Thursday, May 4, 2017 at 12:52:42 PM UTC+3, shree wrote:

Ibr,

You are incorrect in your description of LSTM training.

What you are doing will use the ara.traineddata provided in the repo, there will be no change in output.

Once lstmf files are created, you have to run lstmtraining which will run for days/weeks  to give you a good result.

Please read about LSTM training on wiki.

On May 4, 2017 2:58 PM, "Ibr" <ibr.h...@gmail.com> wrote:
if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if you compiled them in the correct way and got the binaries that you need for training lmstf files, then I recommend to follow the suggestions that is made by tesseract devs which is: once you create an .lstmf file for a certain font (that can be used for Arabic writing) then get the official ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf file in tesseract folder and run the command  tesseract text_image result_text -l ara --oem 1 
what Arabic characters exactly are you trying to enhance the accuracy for ?

On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
Hello All,


I want to make training for Arabic language in Tesseract 4.0, and The result of this version is great but still need some tunning, so I got jTessBoxEditor 2.0 beta.
I tried to modify the incorrect characters and build ara.traineddata. After copying the ara.traineddata to /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run the tesseract on the image.
So any suggestion of how making training for Version 4.0, I already know that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting until Ray makes another updated ara.traineddata.

,Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Ibr

unread,
May 4, 2017, 8:07:58 AM5/4/17
to tesseract-ocr

Ibr

unread,
May 4, 2017, 8:08:51 AM5/4/17
to tesseract-ocr

Ibr

unread,
May 4, 2017, 8:44:05 AM5/4/17
to tesseract-ocr
replied to it

manu pranay

unread,
Jan 28, 2020, 1:38:47 AM1/28/20
to tesseract-ocr
shree,
can you please help me out how to perform arabic training on tesseract 4.

thank you


On Thursday, May 4, 2017 at 3:22:42 PM UTC+5:30, shree wrote:

Ibr,

You are incorrect in your description of LSTM training.

What you are doing will use the ara.traineddata provided in the repo, there will be no change in output.

Once lstmf files are created, you have to run lstmtraining which will run for days/weeks  to give you a good result.

Please read about LSTM training on wiki.

On May 4, 2017 2:58 PM, "Ibr" <ibr....@gmail.com> wrote:
if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if you compiled them in the correct way and got the binaries that you need for training lmstf files, then I recommend to follow the suggestions that is made by tesseract devs which is: once you create an .lstmf file for a certain font (that can be used for Arabic writing) then get the official ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf file in tesseract folder and run the command  tesseract text_image result_text -l ara --oem 1 
what Arabic characters exactly are you trying to enhance the accuracy for ?

On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
Hello All,


I want to make training for Arabic language in Tesseract 4.0, and The result of this version is great but still need some tunning, so I got jTessBoxEditor 2.0 beta.
I tried to modify the incorrect characters and build ara.traineddata. After copying the ara.traineddata to /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run the tesseract on the image.
So any suggestion of how making training for Version 4.0, I already know that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting until Ray makes another updated ara.traineddata.

,Thanks.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Jan 28, 2020, 3:45:39 AM1/28/20
to tesseract-ocr

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bf66a4e-f85f-4b87-bf82-5688cb2cac8a%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Reply all
Reply to author
Forward
0 new messages