github: OCR using multiple machines

78 views
Skip to first unread message

Rick Leir

unread,
Mar 17, 2018, 3:46:07 AM3/17/18
to tesseract-ocr

Hi all,

I got a chance to github my project. Summary:

Run Optical Character Recognition on millions of images, using multiple machines and saving the results in a DB for analysis and other uses.

We had 25 million images, averaging 5Mbytes each, some of which contained text of varying legibility. We wanted to be able to search the images using the Solr search engine, so we needed the text in UTF-8. OCR using the excellent Tesseract took a few minutes per image, and we did not have years for the job.

If you have a small number of images, you may still be interested in this project because the image preprocessing makes for better OCR results from Tesseract.

It has been a while since I wrote this, but it could be useful to an organization doing bulk OCRing. 
cheers -- Rick 

Rick Leir

unread,
Jan 20, 2019, 4:47:39 PM1/20/19
to tesseract-ocr
It got moved in github, and the link is now
https://github.com/crkn-rcdr/tesseract-ocr-test

cheers -- Rick 
Reply all
Reply to author
Forward
0 new messages