Train Tesseract 4.0 for Urdu Nastaleeq fonts

304 views
Skip to first unread message

gupta....@tftus.com

unread,
Oct 24, 2018, 2:59:28 AM10/24/18
to tesseract-ocr
I am trying to train Tesseract for Urdu Nastaleeq fonts. I used 10 Text files of total 1 MB and gave them to the jTesseract editor to create box files and then create traineddata file. But It gives an error: Error: unichar ╪¿╪¼╪º in normproto file is not in unichar set. The output that comes is also very inaccurate. Can somebody help me with this?

Zdenko Podobny

unread,
Oct 24, 2018, 3:01:09 AM10/24/18
to tesser...@googlegroups.com

st 24. 10. 2018 o 8:59 <gupta....@tftus.com> napísal(a):
I am trying to train Tesseract for Urdu Nastaleeq fonts. I used 10 Text files of total 1 MB and gave them to the jTesseract editor to create box files and then create traineddata file. But It gives an error: Error: unichar ╪¿╪¼╪º in normproto file is not in unichar set. The output that comes is also very inaccurate. Can somebody help me with this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3106acc3-fb3f-4816-9a07-a3a31b79c66a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

gupta....@tftus.com

unread,
Oct 24, 2018, 3:04:21 AM10/24/18
to tesseract-ocr
Yes I  did. But its not working out for me.

Zdenko Podobny

unread,
Oct 24, 2018, 3:12:34 AM10/24/18
to tesser...@googlegroups.com
What did not work for you?

Zdenko


st 24. 10. 2018 o 9:04 <gupta....@tftus.com> napísal(a):

Shubham Gupta

unread,
Oct 24, 2018, 3:20:59 AM10/24/18
to tesser...@googlegroups.com
I am using automated way of generating files like .unicharset file , .normproto file, inttemp file etc.which ultimately gives Traineddata file. I am using JTesseract Editor utility.
I gave it my text files and It generated Traineddata and rest other files for me. But when I give nastaleeq input file, It gives Error: unichar ╪¿┌⌐█î in normproto file is not in unichar set  and generate garbage output. 

Zdenko Podobny

unread,
Oct 24, 2018, 3:27:15 AM10/24/18
to tesser...@googlegroups.com
If you want to train for 4.0 version, you should follow training instruction for 4.00 version. 
If you decide to go your own way, it is fine, but please do not claim that official instruction does not work for you, or output is very inaccurate.

Zdenko


st 24. 10. 2018 o 9:20 Shubham Gupta <gupta....@tftus.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages