Hi,
I have been trying
to train Tesseract 4.0 with my own data in order to extract text as a
mix of natural language words and domain-specific (non-natural
language) words (acronyms, identifiers, abbreviations). The
Tesseract standard model has troubles in recognizing domain-specific
words where “visual” words from source are either dropped or
recognized with missing parts in them. So, I decided to train my own
model.
I went through
tutorials, set up a number of experiments, but so far with no real
success. While I could fix the entirely dropped words problem by
lowering the hard coded confidence threshold, and a partial success
in recognizing domain-specific words, the accuracy on natural
language words went down.
Two observations I
have made so far by following experiments:
- In Experiment 1 I
use the available data (as it is) for training (~1 M tokens, and
~150 fonts). After that I generated an evaluation data set for
another ~200 k tokens and ~15 most relevant fonts. Then, I trained
the model by replacing the top layer from the existing Tesseract
traineddata as described at
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Replace-Top-Layer The training
converged a couple of days later and I evaluated the model on a held
out dataset with gold standard (tiff – plain txt). The accuracy I
received was lower than by using the standard Tesseract model. I
noticed that the model is able to recognize some (not all)
domain-specific words, however the performance on the natural
language words went down (where the standard model worked fine). So,
I analyzed the errors and designed another experiment in which I
addressed the observed errors, which were, in my opinion, caused by
data skewness = confusions between characters in rare and complex
contexts.
- In Experiment 2 I
used the entire data set I have (~120 M tokens) and extracted word
and char bigram statistics. Then I took all words with frequencies
over a certain threshold as part of the final training data set. In
addition, I boosted word statistics for words containing
low-frequency char bi-grams (which made me troubles in the experiment before) and appended them into the final training data set. In the
end, it resulted in a ~600 k unique words training data set. This
was then rendered with ~ 150 fonts into tiffs, the evaluation data
set remained a natural language text of ~200 k tokens in ~15 most
relevant fonts. It turned out that training converges too slow – it
has been running for over a week now with the best model of a ~ 0.17%
error rate . Evaluations of pairs of different subsequent model
snapshots on the held out dataset showed no general improvement over
each other, rather random fluctuations between more accurate natural
language words vs. domain-specific and vice versa. More interesting,
models with lower char error rate (< 0.5%) perform worse
(especially on natural language words) than models with higher char
error rate (~ 0.5%). I also noticed that the model captures “language
modeling features” which makes the recognition of misspelled words, “non-natural language” unique identifiers and acronyms difficult. Moreover,
unique identifiers, rare words etc. in text are a big problem,
however can be already recognized in chunks, but not as a whole word.
More specifically, trouble cases are “like-this”, “like/this”
or “this-or-like-this”.
At this point I am
doubting the way I am training Tesseract is correct. So I would
like to ask the community the following questions:
- Should I use a
natural language text or a dictionary of words for training and
evaluation data set?
- How important is the
effect of token redundancy? (Are the errors in
recognition of natural language words caused by the only single
instances of those words in the training data?)
- How to get Tesseract
to recognize freely generated tokens not available in the training
dataset?
Thanks,
Alex