tesseract for handwriting recognition

11,130 views
Skip to first unread message

CMOS

unread,
Jan 26, 2008, 2:32:21 AM1/26/08
to tesseract-ocr
i've heard that tesseract was originaly developed to support
handwriting recognition, but it was optimized only for ocr. im
interested in using this for handwriting recognition as well. so im
glad to learn how to enable ICR in tesseract. hope someone can help me
out..
thanks

Wenjing Jia

unread,
Jan 28, 2008, 7:15:33 PM1/28/08
to tesseract-ocr
I'm not sure about "how to enable ICR in Tesseract". But this might be
an alternative.

I have recently successfully retrained Tesseract for my number-plate
recognition project, where there are only 26 English letters, 10
Arabic digits and one dot. The overall correct recognition rate on
testing samples (none of them has been used for training) has been
improved from less than 80% to nearly 95% through retraining.

I treated my number-plates as "hand-written" characters, where all of
them has suffered shape distortion to some extent, though the font is
supposed to be standard. I used thousands of characters captured from
number-plate images to retrain my Tesseract. This has improved its
performance for my case.

---Maybe it's worth of a try for your application?

CMOS

unread,
Jan 28, 2008, 9:01:41 PM1/28/08
to tesseract-ocr
thanks for the info..
with out any re-training i tried tesseract for ICR, and it seems to
preform OK, but it tend to mis interpret
some letters. im not exactly sure how you train it so if you have some
information please let me know.

Wenjing Jia

unread,
Jan 29, 2008, 12:16:05 AM1/29/08
to tesseract-ocr
I simply followed the Training Procedure in "Training Tesseract
" (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract). It
turns out it is not that hard and it has really improved the
performance for my case.

I think the most critical part of a successful and useful training is
to generate training images. I manually cut lots of character areas
from number-plate images which were taken from real world, resized
them to similar size each as a number plate, thresholded them, put all
such number plate images (containing characters only) into a single-
page image while making sure enough inter-line space (eventually the
image becomes very large). The resultant image is stored as a TIFF
image and used as the Training Image. The other steps are just said in
the instructions. You will need create your dictionary data using
whatever means.

For your case, I think your training image will contain lots of (how
much is much enough?) preprocessed hand-written characters. Others are
just the same.

Ray Smith

unread,
Jan 29, 2008, 6:43:28 PM1/29/08
to tesser...@googlegroups.com
Tesseract was never designed for handwriting, but people have been successful to a limited extent in retraining it for handwriting.

De-italicizing normalization is a useful preprocessing technique that might be useful for most handwriting problems.
Anyone care to add one?
Ray.
Reply all
Reply to author
Forward
0 new messages