Need Help To Train Teseract for Urdu Language

780 views
Skip to first unread message

Qurat-ul-Ain Akram

unread,
Nov 3, 2008, 2:23:04 AM11/3/08
to tesser...@googlegroups.com
Hi all
 
I am working  with the Urdu OCR. I came to know about Tesseract. I tried to train tesseract for the Urdu characters. In the training procedure's instruction , it is written that it cannot support the right to left writing style. I myself tried to training the simple alphabets of Urdu  as follows:
 
1      I made the characters txt file with name UrduCharacters.txt with utf8 encoding
2.     Then from it TIF image is obtained and saved as UrduCharacters.tif
3      Run the tesseract command to makebox file 
              1   tesseract UrduCharacters.tif  UrduCharacters batch.nochop makebox 
 
 
              2    tesseract UrduCharacters.tif  UrduCharacters  -l urd batch.nochop makebox 
I have tried the both the commands for training . In the second one the error occurs indicating the message that "Unable to locate Urdunichaset file"
In the second one the boxfile is generated with four character which are  ~, 7,7,! . If anyone has any idea about it please let me know.
 
 
Regards
Ainie

74yrs old

unread,
Nov 3, 2008, 6:25:38 AM11/3/08
to tesser...@googlegroups.com
eight datafiles have to be generated.  Please visit wiki website of tesseract  where how to generate datafiles are explained in detail.AT present tesseract supports for left to right. In case if you suceeded to generate datafiles, you hsve to read opposite direction i.e. left to right.
cheers

Qurat-ul-Ain Akram

unread,
Nov 3, 2008, 6:40:53 AM11/3/08
to tesser...@googlegroups.com
Thanks for ur Immediate reply
I Followed the instructions given in the wiki site. But fail at the step of generating the Box files ( in the very first step). This is the main problem that why I cannot proceed further. I need the developer assistance to suggest me, whether my there is problem in my procedure OR  where I have to make changes in the code so that Tesseract can generate the box file with the Urdu character set.

74yrs old

unread,
Nov 3, 2008, 7:32:40 AM11/3/08
to tesser...@googlegroups.com
why not try with bbt tool?

Ray Smith

unread,
Nov 5, 2008, 12:15:30 PM11/5/08
to tesser...@googlegroups.com
It looks like it didn't like your image. Can you upload it/ attach it? 
Command line 1 is correct.
Ray.

Anonymous Khan

unread,
Nov 12, 2016, 2:05:27 PM11/12/16
to tesseract-ocr
Ainie, If you have done with your Urdu OCR, may you please send me your work at k13...@nu.edu.pk . i need it for my Final Year Project as a part of it. i will be very greatfull to you.
Reply all
Reply to author
Forward
0 new messages