Training Tesseract 4.0 with own data (fine-tunning), converging and accuracy problems

660 views
Skip to first unread message

Alex

unread,
Feb 21, 2017, 11:59:27 AM2/21/17
to tesseract-ocr

Hi,


I have been trying to train Tesseract 4.0 with my own data in order to extract text as a mix of natural language words and domain-specific (non-natural language) words (acronyms, identifiers, abbreviations). The Tesseract standard model has troubles in recognizing domain-specific words where “visual” words from source are either dropped or recognized with missing parts in them. So, I decided to train my own model.


I went through tutorials, set up a number of experiments, but so far with no real success. While I could fix the entirely dropped words problem by lowering the hard coded confidence threshold, and a partial success in recognizing domain-specific words, the accuracy on natural language words went down.


Two observations I have made so far by following experiments:


  1. In Experiment 1 I use the available data (as it is) for training (~1 M tokens, and ~150 fonts). After that I generated an evaluation data set for another ~200 k tokens and ~15 most relevant fonts. Then, I trained the model by replacing the top layer from the existing Tesseract traineddata as described at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer The training converged a couple of days later and I evaluated the model on a held out dataset with gold standard (tiff – plain txt). The accuracy I received was lower than by using the standard Tesseract model. I noticed that the model is able to recognize some (not all) domain-specific words, however the performance on the natural language words went down (where the standard model worked fine). So, I analyzed the errors and designed another experiment in which I addressed the observed errors, which were, in my opinion, caused by data skewness = confusions between characters in rare and complex contexts.
  2. In Experiment 2 I used the entire data set I have (~120 M tokens) and extracted word and char bigram statistics. Then I took all words with frequencies over a certain threshold as part of the final training data set. In addition, I boosted word statistics for words containing low-frequency char bi-grams (which made me troubles in the experiment before) and appended them into the final training data set. In the end, it resulted in a ~600 k unique words training data set. This was then rendered with ~ 150 fonts into tiffs, the evaluation data set remained a natural language text of ~200 k tokens in ~15 most relevant fonts. It turned out that training converges too slow – it has been running for over a week now with the best model of a ~ 0.17% error rate . Evaluations of pairs of different subsequent model snapshots on the held out dataset showed no general improvement over each other, rather random fluctuations between more accurate natural language words vs. domain-specific and vice versa. More interesting, models with lower char error rate (< 0.5%) perform worse (especially on natural language words) than models with higher char error rate (~ 0.5%). I also noticed that the model captures “language modeling features” which makes the recognition of misspelled words, “non-natural language” unique identifiers and acronyms difficult. Moreover, unique identifiers, rare words etc. in text are a big problem, however can be already recognized in chunks, but not as a whole word. More specifically, trouble cases are “like-this”, “like/this” or “this-or-like-this”.



At this point I am doubting the way I am training Tesseract is correct. So I would like to ask the community the following questions:

  • Should I use a natural language text or a dictionary of words for training and evaluation data set?
  • How important is the effect of token redundancy? (Are the errors in recognition of natural language words caused by the only single instances of those words in the training data?)
  • How to get Tesseract to recognize freely generated tokens not available in the training dataset?


Thanks,

Alex


akmalkady

unread,
Apr 1, 2021, 3:42:38 PM4/1/21
to tesseract-ocr
I wonder what are your latest observations? I am looking for answers to your questions as well.
Reply all
Reply to author
Forward
0 new messages