Hi
I am trying to train my own Tesseract model (V. 4, by replacing top layer as described in the tutorial). Besides of non-explainable OCR problems (see
https://github.com/tesseract-ocr/tesseract/issues/734#issuecomment-299132760), when I compare outputs produced by my model and by one of the standard models, I observe quite big differences.
I trained a model until the 0.005 convergence level (
below the default value 0.01), and then evaluated the model on small data it was trained with. The confidence values (produced by my model) are between 40-55 (even for very frequent and unambiguous words), whereas a standard model achieves between 80-95, with 50-70 for visually ambiguous words.
I was wondering if you achieve confidence levels close to tessdata models? If so, how did you achieve this. Are the standard tesseract models overfitted (Try to OCR a common but misspelled word ;)?
Cheers,
Alex