I'm working on a slide digitization project for a collection of 35mm slides, all similar to the one attached. I'd like to improve my Tesseract output for these slides. My preprocessing techniques follow as so:
1) crop into top third and bottom third to remove center image
2) Apply Gaussian blur
3) Apply Otsu Thresholding with OpenCV
I then run Tesseract on each chunk of the image with load_system_dawg and load_freq_dawg set to false to ignore the main Tesseract dictionary.
I've had mixed success with the slides. I especially run into trouble as each slide is marked with a red circle that can overlay text and ruin the thresholding.
The results I get on the attached image is:
RDUSSEAU . H P32. R7 52
WAR A
DETAIL: center w/ girl .
[IBQHI '
Pgris: Mueee g'Orsay
i; "v ‘ ..:M“W ‘1Pvt. Collection, Paris.
Varnedoe Photo.
What can I do to improve my preprocessing for Tesseract, or are there other specific parameters with Tesseract itself I can manipulate to improve output? How can I deal with separating text from the red circle overlays?
Thank you very much for any suggestions!