2.04
Hi,
I'm a doing a handwritten character recognition using Tesseract. I
tried to train the Tesseract exe for my data set. on windows
I have followed the guide at the wiki.
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract . But I could
not do that.
These are the steps I have done.
Downloaded the Tesseract
2.04
<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract> Create a folder named tessdata in that folder Then created the following files in the tessdata folder. - tessdata/eng.freq-dawg - tessdata/eng.word-dawg - tessdata/eng.user-words - tessdata/eng.inttemp - tessdata/eng.normproto - tessdata/eng.pffmtable - tessdata/eng.unicharset - tessdata/eng.DangAmbigs Then I have a tiff image which contains English letter a in the root folder. Then I have entered the following command. tesseract a.tif fontfile batch.nochop makebox But in this case it gives an error saying ( read_variables_file:Can't open ./tessdata/configs/makeboxUnable to load unichars et file ./tessdata/eng.unicharset) please can someone help me to fix this issue. Thanks in advance. Regards, Thilanka.
No.
The issue of Chinese and handwriting are completely different. With
Chinese, the issue is that of a large character set; with handwriting
- that is, of handwritten printed characters, not cursive - it's the
wide amount of variation. Write the same sentence 10 times, then look
at the page - no two characters will be exactly alike (think of this
as training on multiple examples from the same font - you have to
learn the variations). On top of that, handwriting is 'unique'; each
person's handwriting should be thought of in terms of different fonts
- and there's no way to train for that.
You may have some luck, but don't be surprised if the results are
dramatically less accurate than for printed text.
Cursive writing has its own set of issues - in particular, character
segmentation of joined letters. Tesseract has no support for this type
of segmentation - it has problems with in training from regular
printed pages, when there is not enough space between the characters.
(Sriranga, you have encountered this limitation a number of times, if
the issue tracker is anything to go by).
In summary:
For a single person, with printed characters: you might be lucky.
For multiple people, with printed characters: don't have high expectations.
For cursive: expect close to nothing.
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com