FOSS Project Proposal: tesseract-cloud

64 views

Skip to first unread message

Rich Jones

unread,

Mar 21, 2017, 4:03:52 PM3/21/17

to tesseract-ocr

Hello, all!

I'm currently talking with a group of MuckRock users about automatically OCR'ing a very large set (tens of millions) of CIA documents.

It looks like this will take many months to scan on a single machine, but I think it could happen in far less time if done in parallel on AWS Lambda (or similar) or on an elastic cluster.

It will take a little bit of work to design and build this architecture (two architectures, in fact, one optimized for speed and one optimized for cost), so I think it would be nice if we could build out this system in a way that would benefit the larger community. Therefore, I'd like to float the proposal that we start a new Free and Open Source software project for tools, templates and guides to build queue-based elastic and server-less Tesseract systems which are capable of quickly and affordably scanning millions of documents in the cloud.

Would anybody on this list be interested in working on something like this?

Even more specifically - since Google is maintaining ownership of the Tesseract project, and Google also owns the Google Cloud Platform, would Google be willing to devote some resources into sponsoring the creation of this project, if it could be designed to run on the Google Cloud (GCE/GCF) and using Google technologies (k8s)? If not, does anybody know of any other organizations which would be interested in throwing some resources at this?

It's just an idea, but it's something that I'd like to work on if the resources are available that I think would have a very large impact for a number of different communities.

Thanks for your consideration and feedback,
Rich Jones
https://github.com/Miserlou

Derek

unread,

Mar 22, 2017, 11:17:41 AM3/22/17

to tesseract-ocr

That's a great idea -- I don't have spare time for new projects at the moment, but I wonder if something like OpenOCR might be useful as a starting point for an effort like this: https://github.com/tleyden/open-ocr

Reply all

Reply to author

Forward

0 new messages