As I asked http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
at the bottom..
I made a try to train tesseract to use Arabic..I used one training image
only. I got gibberish, but that gibberish contained the first two
characters of the first word بس.
A problem I found was that when I made the box file, it didn't add all
the words in the picture.. just
بسم الله الرحمن الرحيم
نعيب الزمان والعيب فينا وما
From the image/file...
Another thing is two characters had three lines, I merged them as we
merge 2 lines...Is that correct?
If it isn't entirely hopeless, I will try again..
Thanks.
Mohamed
Files:
http://delicieux.info/fulllog.txt <- log
http://delicieux.info/train1.txt <- training text
http://delicieux.info/fontfile.tif <- training image
http://delicieux.info/fontfile.box <- box file
http://delicieux.info/output.txt <- output of tesseract fontfile.tif
output -l ara