convert -density 350 input.pdf -type Grayscale -background white +matte -depth 32 input.tif
2. Clean the TIFF file using the text cleaner script [1]
textcleaner -t 25 -s 1 -g input.tif cleaned.tif
3. OCR the cleaned TIFF file.
tesseract cleaned.tif ./test-ocr
Any thoughts on ways to improve the accuracy will be gratefully received.
With thanks.
-Corey
$ tesseract filmdailyyearboo00film_4_0742.jp2 pg738 hocr
Many thanks to those who have replied to my question here on the group, and privately.It has given us some avenues to explore in extracting and preserving this information.