Training for Kurdish in Arabic script

199 views
Skip to first unread message

Shree Devi Kumar

unread,
Jan 28, 2020, 2:06:52 AM1/28/20
to tesseract-ocr
Please see https://github.com/Shreeshrii/tesstrain-ckb It uses a modified training text based on what you sent and earlier text that  I had from Pewan and other corpora.

Currently the training data includes
* AWN 0-9
* AEN - ARabic numbers
* No Persian numbers since some shapes are similar to Arabic Numbers

Fonts do not include those which convert 0-9 to either Arabic or Persian numbers.

The replace layer training is still ongoing. The eval results look much better than the official ara or script/Arabic, however I do not have any real world images for testing.

ArialArial BoldTahomaTahoma Bold
tessdata_fast/araAccuracy62.7463.4961.5661.71
tessdata_fast/araBasic Arabic95.6895.2295.7694.10
tessdata_fast/araArabic Extended0.311.130.411.32
tessdata_fast/script/ArabicAccuracy80.9980.8383.0277.17
tessdata_fast/script/ArabicBasic Arabic96.6896.3496.0593.87
tessdata_fast/script/ArabicArabic Extended57.2058.2363.7654.72
ckbLayer_1.661_152089_296500
ckbLayer_fastAccuracy98.2097.7898.0696.13
ckbLayer_fastBasic Arabic99.1099.1598.5498.44
ckbLayer_fastArabic Extended98.3098.7099.1096.27


On Mon, Jan 13, 2020 at 7:17 PM Ayub Rauf wrote:
Hi, 
I attached full training text with forbidden_characters in it.
really both of number types will be used and I see two type numbers written in books but Kurdish institute verified that Arabic numbers will be used from now on. Persian numbers written by Iranian Kurds and Arabic number used by Iraqi Kurds but as I said numbers in ckb should be written by Arabic type, but we have to recognize two type in OCR. 
just like two types of "ك" and "ک" that written in books but now we only use "ک".
I think these similarities won't into problem after that we can correct letters in a spell checker. 
As I said before Arial and Tahoma fonts are the most used fonts books written by. 


manu pranay

unread,
Feb 1, 2020, 1:03:44 AM2/1/20
to tesser...@googlegroups.com
Thank you so much for your help shree. 
the links you provided were very helpful for me. 

now i am trying to train lstm training with retraining the top layer.
can you please provide me with the commands for  retraining top layer .

thank you very much.
 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWm%3DXQaxBergf5-OUE-C8jB3u12dSOPUPchRZT4w21Z-g%40mail.gmail.com.

Shree Devi Kumar

unread,
Feb 1, 2020, 1:53:55 AM2/1/20
to tesseract-ocr



--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Feb 1, 2020, 2:03:38 AM2/1/20
to tesseract-ocr
lstmtraining \
  --debug_interval -1 \
  --traineddata data/modi/modi.traineddata \
  --append_index 5 --net_spec "[Lfx128 O1c1]" \
  --continue_from data/mar/modi.lstm \
  --model_output data/modi/checkpoints/modiLayer \
  --train_listfile data/modi/list.train \
  --eval_listfile data/modi/list.eval \
  --max_iterations 999999

On Sat, Feb 1, 2020 at 11:33 AM manu pranay <pranaym...@gmail.com> wrote:

manu pranay

unread,
Feb 1, 2020, 6:01:42 AM2/1/20
to tesser...@googlegroups.com
thank you shree.
I am done with my retraining top layer training with a good accuracy rate.
but i wanted to know, how can find accuracy in terms of percentage ?
and can you please help how can i train handwritten pdf.
thank you very much for your help.


Shree Devi Kumar

unread,
Feb 1, 2020, 6:43:10 AM2/1/20
to tesseract-ocr

mit

unread,
Jun 14, 2020, 1:50:02 PM6/14/20
to tesseract-ocr
Hi Shree,

Can we train tesseract for handwritten date?

TIA
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Jun 14, 2020, 9:16:56 PM6/14/20
to tesseract-ocr
See https://github.com/tesseract-ocr/tesstrain/wiki for links regarding tesseract training for handwriting

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d12857d3-a217-48a1-88b7-6865213b777bo%40googlegroups.com.

mit

unread,
Jun 15, 2020, 12:30:16 AM6/15/20
to tesseract-ocr
Reply all
Reply to author
Forward
0 new messages