Hi,
We are interested in improving the performance of Tesseract and we have prepared a large set with over 11k pages annotated manually with text lines bounding boxes and the transcribed text. We have been evaluating fine tuning Tesseract with this set and we observed that there is a slight decrease in performance and we would like to identify the issue and run the fine tuning again. We have some questions about the process and we would be helpful if you could help us understanding the fine tuning process for Tesseract.
We have done several tests to fine-tune Tesseract using this set with mixed results. We evaluate the performance agains an existing benchmark that we name the mini-holistic set. The metrics that we consider are Levenshtein distance and % of missing words (which considers unique words). Using our manually annotated set we obtain a similar Levenshtein distance (probably not statistically different) but we get a higher % of missing words, e.g. from 7% to over 9.6%.
Our results were mixed; we saw significant improvement in some files while others got a lot worse. Documents with tables and documents that seem not-scanned saw an improvement in the evaluation metrics. With scanned documents, the fine tuned seemed to perform worst with the fine-tuned model. The polarization effect was greater compared to training with just high-quality data.
We find that there is no impact using this parameter, we find that the BCER is similar to other experiments without this parameter.
We find that the (a) set obtains a low BCER 0.042 during training, while (b) gets ~6% BCER, but the performance in Levenshtein distance and % of missing words is similar to previous output with both (a) and (b).