35mm Slide OCR

46 views
Skip to first unread message

Lauren Arnett

unread,
Jan 30, 2018, 1:44:45 AM1/30/18
to tesseract-ocr
I'm working on a slide digitization project for a collection of 35mm slides, all similar to the one attached. I'd like to improve my Tesseract output for these slides. My preprocessing techniques follow as so:

1) crop into top third and bottom third to remove center image
2) Apply Gaussian blur
3) Apply Otsu Thresholding with OpenCV

I then run Tesseract on each chunk of the image with load_system_dawg and load_freq_dawg set to false to ignore the main Tesseract dictionary.

I've had mixed success with the slides. I especially run into trouble as each slide is marked with a red circle that can overlay text and ruin the thresholding.

The results I get on the attached image is:

RDUSSEAU . H P32. R7 52
WAR A

DETAIL: center w/ girl .
[IBQHI '

Pgris: Mueee g'Orsay
i; "v ‘ ..:M“W ‘1Pvt. Collection, Paris.
Varnedoe Photo.


What can I do to improve my preprocessing for Tesseract, or are there other specific parameters with Tesseract itself I can manipulate to improve output? How can I deal with separating text from the red circle overlays?

Thank you very much for any suggestions!
67615.png

Tom Morris

unread,
Jan 30, 2018, 9:23:02 PM1/30/18
to tesseract-ocr
Are you sure you attached the correct image? That looks more like a Rodin than a Rousseau.

The red circle and printing might be amenable to a color selection technique after which you could desaturate, lighten, replace with background color, etc. Of course if it's overlapping the text, that'll complicate things. The foxing is also going to cause problems due to its uneven nature, but on the plus side, you've got pretty good dark blacks to work with in the print.

The Rodin label looks like it's lifted on one side, warping the image. If that's common, you might want to consider a dewarping algorithm. Ditto for deskewing crooked labels.

Good luck! Looks like a fun project.

Tom
Reply all
Reply to author
Forward
0 new messages