Hello all,
I am training Tesseract to recognize specific images taken by a cell phone camera. I plan to create a new "language" and 2 new fonts for this training. In theory, this should be very simple and easy to do, but in fact I got lower accuracy with my new .traineddata than with the standard eng.traineddata. The more images I used for my training, the lower the accuracy I got.
The texts in the images varied in boldness and noise. I've tried correcting them with ImageMagick (300 density, black and white).



Notice that image in the middle (no.2) has bolder letters than the others. The white area is cleared out because of noise.
Here's what I've done:
1. Adding a word-dawg file including the common words in the images.
2. Adding a unicharambigs file including the common mistakes like VV for W
3. Selecting the good letter model. The noisy letters were not included in the training.
Please suggest what I should do more to get higher accuracy. Thanks in advance.
Regards,
Victoria