Input in Arabic Eastern Numbers and Output in Arabic Western Numbers

94 views
Skip to first unread message

Mobeen Ali

unread,
Oct 14, 2019, 2:46:21 AM10/14/19
to tesseract-ocr
Hello everyone! I'm stuck with a problem of creating a traineddata file that reads numerals in arabic and gives output in english numerals.
  • Input = AEN Arabic Eastern Numbers {ِ٠١٢٣٤٥٦٧٨٩}
  • Output = AWN Arabic Western Numbers {0123456789}
I  have created a traineddata file successfully with no issues and very good accuracy now but this traineddata file takes arabic numerals as input and gives arabic numerals as output.

But what i want is it should take arabic numerals as input and give english numerals as output

Please i need help if someone knows anything please help!

Shree Devi Kumar

unread,
Oct 14, 2019, 4:05:01 AM10/14/19
to tesseract-ocr
Replace AEN in your box files with AWN and rerun training, using the original tif files

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2edb580d-c16e-4b0a-a704-15929982a372%40googlegroups.com.

Mobeen Ali

unread,
Dec 1, 2019, 6:41:46 AM12/1/19
to tesseract-ocr
So, here's what i did,
  1. i ran text2image with my training_text file
    text2image --text /home/mobeen/customtrain/langdata/ara/ara.training_text \
    --outputbase /home/mobeen/customtrain/tiff-box/ara.Arial \
    --fonts_dir /home/mobeen/Documents/fonts \
    --font 'Arial'
    By this, i got tiff and box files as output. I removed the box file created by text2image as it is not in lstm format
  2. Then I ran
    tesseract /home/mobeen/customtrain/tiff-box/ara.Arial.tif /home/mobeen/customtrain/tiff-box/ara.Arial -l ara-new lstmbox
    this gave me the lstm format box file.
  3. Next I opened this box file replaced all AEN with AWN and save the file.
  4. Then i ran tesstrain using --my_boxtiff_dir argument, as follows:
    src/training/tesstrain.sh \
    --fonts_dir /home/mobeen/Documents/fonts \
    --lang ara --linedata_only --noextract_font_properties \
    --langdata_dir ../langdata \
    --tessdata_dir ./tessdata \
    --output_dir ~/customtrain/aratrain \
    --fontlist 'Arial' \
    --my_boxtiff_dir /home/mobeen/customtrain/tiff-box
    this generated the lstmf file and gave me a starter traineddata file.
  5. Next i ran,
    training/lstmtraining --debug_interval -1 \
    --traineddata ~/customtrain/aratrain/ara/ara.traineddata \
    --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
    --model_output ~/customtrain/araoutput/base --learning_rate 20e-4 \
    --train_listfile ~/customtrain/aratrain/ara.training_files.txt \
    --eval_listfile ~/customtrain/araeval/ara.training_files.txt \
    --max_iterations 3600 &>~/customtrain/araoutput/basetrain.log
    In another tereminal window i ran,
    tail -f ~/customtrain/araoutput/basetrain.log
    Wich displayed this: 
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 3 :
    Mean rms=0.585%, delta=0.957%, train=2.68%(4.53%), skip ratio=0%
    Iteration 3588: GROUND  TRUTH : يف نأ ةفاضإ ١ مالفا و امك خيرات ٢ ةيسيئرلا ٣ مقر ٤ برعلا
    Iteration 3588: BEST OCR TEXT : يف نأ ةفشإ ١ مالا و امك خيراا ٢ ةيسيئرلا ٣ مقر ٤ برملا
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 4 :
    Mean rms=0.588%, delta=0.963%, train=2.691%(4.558%), skip ratio=0%
    Iteration 3589: GROUND  TRUTH : ىدتنم ٨ نآلا دق ٥ مسق ٧ ةفاضإ _ ٦ عيقوتلا ٩ ةيبرعلا ىدتنم
    Iteration 3589: BEST OCR TEXT : ىدتنم ٥ نآلا هق ٥ مسا ٧ ةفاضإ _ ٦ عيقوتلا ٢ ةيبرعلا ىدتنم
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 5 :
    Mean rms=0.59%, delta=0.968%, train=2.705%(4.587%), skip ratio=0%
    Iteration 3590: GROUND  TRUTH : ةيزمرلا ٦ ىلإ ٩ جماربلا ٨ ذنم ٥ ١ ىدتنملا ٧ نع ىدتنم
    Iteration 3590: BEST OCR TEXT : ةيزمرلا ١ ىلإ ٩ جماربلا ٨ انم ٥ ١ ىدتنسلا ٧ نع ىدتنم
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 6 :
    Mean rms=0.592%, delta=0.971%, train=2.717%(4.61%), skip ratio=0%
    Iteration 3591: GROUND  TRUTH : هيف ٧ دمحأ ٩ ةيزمرلا ٣ دوك ٥ رورملا ١ حب هل ٦ ةفاك ٨ ماعلا ٣ يلع
    Iteration 3591: BEST OCR TEXT : هيف ٧ دمحأ ٣ ةيزمرلا ٣ دوك ٥ رورملا ٠ نب هل ٦ ةفا ٥ مسقا ٣ يلع
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 7 :
    Mean rms=0.594%, delta=0.976%, train=2.738%(4.643%), skip ratio=0%
    Iteration 3592: GROUND  TRUTH : ىلعو ٧ نب ٦ ةكراشملا ٥ خيرات ٨ عيطتست ٩ ىلعألا
    Iteration 3592: BEST OCR TEXT : ىلاو ٧ نب ٩ ةكراشملا ٥ خيرقت ٨ عيقطتست ٩ ىلعأل
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 8 :
    Mean rms=0.596%, delta=0.979%, train=2.751%(4.689%), skip ratio=0%
    Iteration 3593: GROUND  TRUTH : هيلع ٨ دئاصق ٦ لئاسرلا ٧ برغملا ٥ نيطسلف ١ يه ٣ ماظنلا ٩ تاكراشم
    Iteration 3593: BEST OCR TEXT : هيلع ٨ دئاضق ٩ لئاسرلا ٧ برتملا ٥ نيطسلفا ٢ يه ٣ ماظنلا ٩ تاكراشم
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 9 :
    Mean rms=0.599%, delta=0.984%, train=2.765%(4.722%), skip ratio=0%
    Iteration 3594: GROUND  TRUTH : / ٩ ةديدج ٦ يذلا نإ ال ٧ سلجم ٩ هب ٠ ىلوألا ٥ روصلا ٨ لا راوزلا
    Iteration 3594: BEST OCR TEXT : /
    ٩ ةديدج ٦ يذلا نإ ال ٧ سدجم ٩ هب ٠ ىلوألا ٨ روصلا ٨ لا راولا
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 10 :
    Mean rms=0.601%, delta=0.987%, train=2.773%(4.739%), skip ratio=0%
    Iteration 3595: GROUND  TRUTH : عيضاوم ٨ تاكراشم ٥ انب ٣ تانب ٧ رابخأ ٠ ىلع ٦ ريغ اذه دقو لكشب ٩
    Iteration 3595: BEST OCR TEXT : عيضاوم ٨ تاكراشم ٥ انب ٣ تانب ٧ رايخأ ٠ ىلع ٦ ريغ اذه دقو لكشب ٩
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 11 :
    Mean rms=0.602%, delta=0.988%, train=2.777%(4.744%), skip ratio=0%
    Iteration 3596: GROUND  TRUTH : خيشلا ٩ ثحبلا ٨ رييغت ٦ نيب ١ مسا ءزجلا ٧ يف لالخ ٥ عوضوملا
    Iteration 3596: BEST OCR TEXT : خيللا ٩ ثحبلا ٨ ريغت ٦ نيب ١ مسا ءزجلا ٧ يف لالخ ٥ عوضوملا
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 12 :
    Mean rms=0.603%, delta=0.99%, train=2.782%(4.758%), skip ratio=0%
    Iteration 3597: GROUND  TRUTH : موي ٦ نوكي نم ٨ ةيزم١ رلا ٥ىتح ٩ جمارب ٣ زكرم ٧ نأ ٠ عقوملا ريغ
    Iteration 3597: BEST OCR TEXT : موي ٦ نوكج نم ٨ ةيزم١ رلا ٥وغح ٦ جمارب ٣ زكرم ٧ نأ ٠ عقوملا ريغ
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 13 :
    Mean rms=0.605%, delta=0.993%, train=2.794%(4.775%), skip ratio=0%
    Iteration 3598: GROUND  TRUTH : نم غلبي ٢ نودجاوتملا ٣ ةدهاشم ١ ظفح ٤ تاكراشملا ٠ ةطساوب
    Iteration 3598: BEST OCR TEXT : ني علبيب ٣ نوضجاوتملا ٣ ةداضشم ١ ثفنح ٤ تاكراشملا ٠ ةطساوب
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 14 :
    Mean rms=0.608%, delta=1%, train=2.819%(4.825%), skip ratio=0%
    Iteration 3599: GROUND  TRUTH : يصخشلا ٨ دمحم ٥ ءاوح ١ جمارب هل ٦ ةروصلا و ٧ ماظن ٩ ماع ناكو
    Iteration 3599: BEST OCR TEXT : يصخشلا ٨ دمحم ٥ ءاوح ١ جمارب هل ١ ةروصلا و ٧ ماظنن ٩ ماع نقكر
    File /home/mobeen/customtrain/aratrain/ara.Arial.exp0.lstmf line 15 :
    Mean rms=0.61%, delta=1.002%, train=2.831%(4.844%), skip ratio=0%
    At iteration 2182/3600/3600, Mean rms=0.61%, delta=1.002%, char train=2.831%, word train=4.844%, skip ratio=0%,  New worst char error = 2.831 wrote checkpoint.

    Finished! Error rate = 0.064
    As you can see it still reads AEN as AEN not AWN
Am I doing something wrong? and what should i do?


On Monday, October 14, 2019 at 11:05:01 AM UTC+3, shree wrote:
Replace AEN in your box files with AWN and rerun training, using the original tif files

On Mon, Oct 14, 2019, 12:16 Mobeen Ali <moby...@gmail.com> wrote:
Hello everyone! I'm stuck with a problem of creating a traineddata file that reads numerals in arabic and gives output in english numerals.
  • Input = AEN Arabic Eastern Numbers {ِ٠١٢٣٤٥٦٧٨٩}
  • Output = AWN Arabic Western Numbers {0123456789}
I  have created a traineddata file successfully with no issues and very good accuracy now but this traineddata file takes arabic numerals as input and gives arabic numerals as output.

But what i want is it should take arabic numerals as input and give english numerals as output

Please i need help if someone knows anything please help!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages