Create boxfile and unicharset for RTL language

68 views
Skip to first unread message

Ava Nimaee

unread,
Aug 31, 2017, 5:09:58 AM8/31/17
to tesseract-ocr
Hi i need your help
i need to create boxfile and unicharset for Persian language. i used the syntax that i used for Latin. but the results are revers. could you please tell me how do i  do this? 
thanks

ShreeDevi Kumar

unread,
Aug 31, 2017, 9:02:59 AM8/31/17
to tesser...@googlegroups.com
Use tesstrain.sh for training.

It should apply the appropriate RTL flags for persian language.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/42bf0393-8b56-43c2-b88d-af68b4967c71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ava Nimaee

unread,
Sep 1, 2017, 1:59:58 PM9/1/17
to tesseract-ocr
I understand just difference RTL language with LTR is at unicharset.
i create unichraset with its tool but how can i create xheight for persian. 
there is my unicharset after convert it to RTL
36
NULL 0 NULL 0
Joined 7 0,69,188,255,486,1218,0,30,486,1188 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,69,186,255,892,2138,0,80,892,2058 Common 2 10 2 |Broken|0|1 # Broken
س‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 3 13 3 س‍ # س‍ [633 200d ]x
‍ل‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 4 18 4 ‍ل‍ # ‍ل‍ [200d 644 200d ]x
‍ا 1 0,255,0,255,0,0,0,0,0,0 Inherited 5 18 5 ‍ا # ‍ا [200d 627 ]x
م 1 0,64,134,241,51,272,0,46,56,313 Arabic 6 13 6 م # م [645 ]x
ع‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 7 13 7 ع‍ # ع‍ [639 200d ]x
‍ی‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 8 18 8 ‍ی‍ # ‍ی‍ [200d 6cc 200d ]x
‍ک‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 9 18 9 ‍ک‍ # ‍ک‍ [200d 6a9 200d ]x
‍م 1 0,255,0,255,0,0,0,0,0,0 Inherited 10 18 10 ‍م # ‍م [200d 645 ]x
م‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 11 13 11 م‍ # م‍ [645 200d ]x
‍ه‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 12 18 12 ‍ه‍ # ‍ه‍ [200d 647 200d ]x
‍ذ 1 0,255,0,255,0,0,0,0,0,0 Inherited 13 18 13 ‍ذ # ‍ذ [200d 630 ]x
ا 1 26,117,200,255,11,181,7,82,33,222 Arabic 14 13 14 ا # ا [627 ]x
ک‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 15 13 15 ک‍ # ک‍ [6a9 200d ]x
‍ج‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 16 18 16 ‍ج‍ # ‍ج‍ [200d 62c 200d ]x
ی‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 17 13 17 ی‍ # ی‍ [6cc 200d ]x
‍ی 1 0,255,0,255,0,0,0,0,0,0 Inherited 18 18 18 ‍ی # ‍ی [200d 6cc ]x
ش‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 19 13 19 ش‍ # ش‍ [634 200d ]x
‍م‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 20 18 20 ‍م‍ # ‍م‍ [200d 645 200d ]x
ل‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 21 13 21 ل‍ # ل‍ [644 200d ]x
‍ن 1 0,255,0,255,0,0,0,0,0,0 Inherited 22 18 22 ‍ن # ‍ن [200d 646 ]x
‍ب‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 23 18 23 ‍ب‍ # ‍ب‍ [200d 628 200d ]x
‍ز 1 0,255,0,255,0,0,0,0,0,0 Inherited 24 18 24 ‍ز # ‍ز [200d 632 ]x
‍ت 1 0,255,0,255,0,0,0,0,0,0 Inherited 25 18 25 ‍ت # ‍ت [200d 62a ]x
. 10 12,108,64,140,18,52,9,77,52,193 Common 26 6 26 . # . [2e ]p
و 1 0,68,137,238,65,290,0,27,62,256 Arabic 27 13 27 و # و [648 ]x
ن‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 28 13 28 ن‍ # ن‍ [646 200d ]x
‍س‍ 1 0,255,0,255,0,0,0,0,0,0 Inherited 29 18 29 ‍س‍ # ‍س‍ [200d 633 200d ]x
ن 1 0,88,163,255,68,321,0,52,76,354 Arabic 30 13 30 ن # ن [646 ]x
ب‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 31 13 31 ب‍ # ب‍ [628 200d ]x
‍و 1 0,255,0,255,0,0,0,0,0,0 Inherited 32 18 32 ‍و # ‍و [200d 648 ]x
پ‍ 1 0,255,0,255,0,0,0,0,0,0 Arabic 33 13 33 پ‍ # پ‍ [67e 200d ]x
‍ر 1 0,255,0,255,0,0,0,0,0,0 Inherited 34 18 34 ‍ر # ‍ر [200d 631 ]x
ی 1 0,71,148,225,95,253,0,45,103,279 Arabic 35 13 35 ی # ی [6cc ]x
but "Inherited"  don't have any unicharset in langdata and without it train is not so good
fpr example i fine tune for "لا".   it is part of "Inherited".
can you please tell me how can i create xheight for persian's font and about "Inherited" and also about appropriate RTL flags for persian language.
thanks
On Thursday, August 31, 2017 at 5:32:59 PM UTC+4:30, shree wrote:
Use tesstrain.sh for training.

It should apply the appropriate RTL flags for persian language.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 31, 2017 at 2:39 PM, Ava Nimaee <beigy....@gmail.com> wrote:
Hi i need your help
i need to create boxfile and unicharset for Persian language. i used the syntax that i used for Latin. but the results are revers. could you please tell me how do i  do this? 
thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages