Train Tesseract 4.0 for Urdu Nastaleeq fonts

gupta....@tftus.com

unread,

Oct 24, 2018, 2:59:28 AM10/24/18

to tesseract-ocr

I am trying to train Tesseract for Urdu Nastaleeq fonts. I used 10 Text files of total 1 MB and gave them to the jTesseract editor to create box files and then create traineddata file. But It gives an error: Error: unichar ╪¿╪¼╪º in normproto file is not in unichar set. The output that comes is also very inaccurate. Can somebody help me with this?

Zdenko Podobny

unread,

Oct 24, 2018, 3:01:09 AM10/24/18

to tesser...@googlegroups.com

Did you read https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00?

Zdenko

st 24. 10. 2018 o 8:59 <gupta....@tftus.com> napísal(a):

I am trying to train Tesseract for Urdu Nastaleeq fonts. I used 10 Text files of total 1 MB and gave them to the jTesseract editor to create box files and then create traineddata file. But It gives an error: Error: unichar ╪¿╪¼╪º in normproto file is not in unichar set. The output that comes is also very inaccurate. Can somebody help me with this?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3106acc3-fb3f-4816-9a07-a3a31b79c66a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

gupta....@tftus.com

unread,

Oct 24, 2018, 3:04:21 AM10/24/18

to tesseract-ocr

Yes I did. But its not working out for me.

Zdenko Podobny

unread,

Oct 24, 2018, 3:12:34 AM10/24/18

to tesser...@googlegroups.com

What did not work for you?

Zdenko

st 24. 10. 2018 o 9:04 <gupta....@tftus.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/faa6c1e2-2846-4704-8e50-0ed3e7728302%40googlegroups.com.

Shubham Gupta

unread,

Oct 24, 2018, 3:20:59 AM10/24/18

to tesser...@googlegroups.com

I am using automated way of generating files like .unicharset file , .normproto file, inttemp file etc.which ultimately gives Traineddata file. I am using JTesseract Editor utility.

I gave it my text files and It generated Traineddata and rest other files for me. But when I give nastaleeq input file, It gives Error: unichar ╪¿┌⌐█î in normproto file is not in unichar set and generate garbage output.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xtDk8htkysLWrQRS6eDN1676tcZWCrRbi6MAvfk6NqLA%40mail.gmail.com.

Zdenko Podobny

unread,

Oct 24, 2018, 3:27:15 AM10/24/18

to tesser...@googlegroups.com

If you want to train for 4.0 version, you should follow training instruction for 4.00 version.

If you decide to go your own way, it is fine, but please do not claim that official instruction does not work for you, or output is very inaccurate.

Zdenko

st 24. 10. 2018 o 9:20 Shubham Gupta <gupta....@tftus.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAOYxz4oy1Ydfo3VkM3F04fLv-bvheJX_Qpaui9370AtH50e3vA%40mail.gmail.com.

Reply all

Reply to author

Forward