Hi all,
I got a chance to github my project. Summary:
Run Optical Character Recognition on millions of images, using multiple machines and saving the results in a DB for analysis and other uses.
We had 25 million images, averaging 5Mbytes each, some of which contained text of varying legibility. We wanted to be able to search the images using the Solr search engine, so we needed the text in UTF-8. OCR using the excellent Tesseract took a few minutes per image, and we did not have years for the job.
If you have a small number of images, you may still be interested in this project because the image preprocessing makes for better OCR results from Tesseract.