Need Help To Train Teseract for Urdu Language

Qurat-ul-Ain Akram

unread,

Nov 3, 2008, 2:23:04 AM11/3/08

to tesser...@googlegroups.com

Hi all

I am working with the Urdu OCR. I came to know about Tesseract. I tried to train tesseract for the Urdu characters. In the training procedure's instruction , it is written that it cannot support the right to left writing style. I myself tried to training the simple alphabets of Urdu as follows:

1 I made the characters txt file with name UrduCharacters.txt with utf8 encoding

2. Then from it TIF image is obtained and saved as UrduCharacters.tif

3 Run the tesseract command to makebox file

1 tesseract UrduCharacters.tif UrduCharacters batch.nochop makebox

2 tesseract UrduCharacters.tif UrduCharacters -l urd batch.nochop makebox

I have tried the both the commands for training . In the second one the error occurs indicating the message that "Unable to locate Urdunichaset file"

In the second one the boxfile is generated with four character which are ~, 7,7,! . If anyone has any idea about it please let me know.

Regards

Ainie

74yrs old

unread,

Nov 3, 2008, 6:25:38 AM11/3/08

to tesser...@googlegroups.com

eight datafiles have to be generated. Please visit wiki website of tesseract where how to generate datafiles are explained in detail.AT present tesseract supports for left to right. In case if you suceeded to generate datafiles, you hsve to read opposite direction i.e. left to right.
cheers

Qurat-ul-Ain Akram

unread,

Nov 3, 2008, 6:40:53 AM11/3/08

to tesser...@googlegroups.com

Thanks for ur Immediate reply

I Followed the instructions given in the wiki site. But fail at the step of generating the Box files ( in the very first step). This is the main problem that why I cannot proceed further. I need the developer assistance to suggest me, whether my there is problem in my procedure OR where I have to make changes in the code so that Tesseract can generate the box file with the Urdu character set.

74yrs old

unread,

Nov 3, 2008, 7:32:40 AM11/3/08

to tesser...@googlegroups.com

why not try with bbt tool?

Ray Smith

unread,

Nov 5, 2008, 12:15:30 PM11/5/08

to tesser...@googlegroups.com

It looks like it didn't like your image. Can you upload it/ attach it?

Command line 1 is correct.

Ray.

Anonymous Khan

unread,

Nov 12, 2016, 2:05:27 PM11/12/16

to tesseract-ocr

Ainie, If you have done with your Urdu OCR, may you please send me your work at k13...@nu.edu.pk . i need it for my Final Year Project as a part of it. i will be very greatfull to you.

Reply all

Reply to author

Forward