Tesseract fine tuning questions

73 views

Skip to first unread message

Antonio Jimeno Yepes

unread,

May 11, 2024, 6:43:51 AMMay 11

to tesseract-ocr

Hi,

We are interested in improving the performance of Tesseract and we have prepared a large set with over 11k pages annotated manually with text lines bounding boxes and the transcribed text. We have been evaluating fine tuning Tesseract with this set and we observed that there is a slight decrease in performance and we would like to identify the issue and run the fine tuning again. We have some questions about the process and we would be helpful if you could help us understanding the fine tuning process for Tesseract.

We have done several tests to fine-tune Tesseract using this set with mixed results. We evaluate the performance agains an existing benchmark that we name the mini-holistic set. The metrics that we consider are Levenshtein distance and % of missing words (which considers unique words). Using our manually annotated set we obtain a similar Levenshtein distance (probably not statistically different) but we get a higher % of missing words, e.g. from 7% to over 9.6%.

We realized our fine-tuned model degraded performance on scanned documents, so we used PIL to add noise to the preprocessed bounding boxes and train them with high quality data together. We added noise with random combination of rescaling, rotation, blur and salt and pepper noise.

Our results were mixed; we saw significant improvement in some files while others got a lot worse. Documents with tables and documents that seem not-scanned saw an improvement in the evaluation metrics. With scanned documents, the fine tuned seemed to perform worst with the fine-tuned model. The polarization effect was greater compared to training with just high-quality data.

Is the way we do augmentation correct?
What can potentially cause this kind of mixed results?

We have tried different parameters. One of the is perfect_sample_delay with different values, from 1 to 100 to remove the impact of examples for which Tesseract had a perfect output.

We find that there is no impact using this parameter, we find that the BCER is similar to other experiments without this parameter.

Is our understanding of this parameter correct?
Why we might not see any impact when using this parameter?

We have tried splitting the set into examples for which Tesseract 5 has a perfect output (a) and examples for which it fails to produce a perfect output (b).

We find that the (a) set obtains a low BCER 0.042 during training, while (b) gets ~6% BCER, but the performance in Levenshtein distance and % of missing words is similar to previous output with both (a) and (b).

Performance is similar despite different performance. Why do you think is the case?
In this case, the cases that are correct initially with Tesseract should have no or limited impact in the training.
Using the perfect_sample_delay might prevent any learning from happening since all the examples are initially perfect in the (a) set (we checked values from 1 to 100). Why do we see no impact? How would you recommend logging that this parameter is working as expected?

Thank you in advance for you help,

Antonio

Jeremiah

unread,

May 11, 2024, 2:37:28 PMMay 11

to tesseract-ocr

I don't know the answer to most of these questions, however one thing I noticed in your question was the addition of rotation within the training data for better performance on scanned documents. This may imply that the scanned documents being fed to Tesseract are also rotated. Tesseract performs poorly with images that have any sort of rotation--even a few degrees may noticeably degrade performance. The recommended approach for dealing with this is image pre-processing rather than re-training with rotated text. There are various programs that auto-rotate text that can be used in a pre-processing pipeline, and this PR allows for getting the angle calculated by Tesseract during layout analysis without running recognition.

Reply all

Reply to author

Forward

0 new messages