Training Tesseract for Photographed Images

59 views

Skip to first unread message

Victoria A.

unread,

Jul 18, 2014, 2:39:43 AM7/18/14

to tesser...@googlegroups.com

Hello all,

I am training Tesseract to recognize specific images taken by a cell phone camera. I plan to create a new "language" and 2 new fonts for this training. In theory, this should be very simple and easy to do, but in fact I got lower accuracy with my new .traineddata than with the standard eng.traineddata. The more images I used for my training, the lower the accuracy I got.

The texts in the images varied in boldness and noise. I've tried correcting them with ImageMagick (300 density, black and white).

Notice that image in the middle (no.2) has bolder letters than the others. The white area is cleared out because of noise.

Here's what I've done:

1. Adding a word-dawg file including the common words in the images.

2. Adding a unicharambigs file including the common mistakes like VV for W

3. Selecting the good letter model. The noisy letters were not included in the training.

Please suggest what I should do more to get higher accuracy. Thanks in advance.

Regards,

Victoria

Victoria A.

unread,

Jul 18, 2014, 2:48:37 PM7/18/14

to tesser...@googlegroups.com

Here are more pictures taken by a camera. Notice the difference in boldness in number 1, 2, and 3. How many training images are the minimum required so that the training will be good?

(1)

(2)

(3)

Reply all

Reply to author

Forward

0 new messages