Is there any way to speed up extraction using tesseract OCR Engine, while tiff file is having 600-700 pages?

174 views
Skip to first unread message

James Worldprogram

unread,
Apr 19, 2015, 8:23:20 AM4/19/15
to tesser...@googlegroups.com

During processing of tiff files, which are having 600 - 700 pages from Tesseract OCR engine with hocr option, we monitored that files are taking around 40 - 50 minutes.

We monitored that it is so much time for processing large files.

Do we have any way to speed up the process?

Following command is using: -

<Drive>:\Tesseract-OCR>tesseract.exe "Source_Tiff_File" "Destination_File" hocr

Tom Morris

unread,
Apr 19, 2015, 11:31:36 PM4/19/15
to tesser...@googlegroups.com
On Sunday, April 19, 2015 at 8:23:20 AM UTC-4, James Worldprogram wrote:

During processing of tiff files, which are having 600 - 700 pages from Tesseract OCR engine with hocr option, we monitored that files are taking around 40 - 50 minutes.

So, 14-15 pages/minute or 4-5 seconds/page.  What is your goal for performance?  What kind of system are you running on?  What is the resolution and size of your images? What version of Tesseract are you running? What ...  Well, you get the idea.  What are all the other details you left out?

Allistair C

unread,
Apr 20, 2015, 4:08:47 AM4/20/15
to tesser...@googlegroups.com
What Tom said.

However, let's assume all your variables are constant - resolution has to be just what you have, file format has to be TIF etc. then you can use a divide and conquer distributed computing pattern. That is, grab a machine that holds a queue of work and then make that queue farm out the work to worker machines - you scale to as many worker machines as brings your problem within tolerance. A great and relatively inexpensive platform to do this on would be AWS. It's how Pixar movies are rendered (render farms). That is, if nothing else can be optimised, stop trying to vertically scale and change to horizontal scaling. 

All that said, I would start by answering Tom's questions as your resolution may be higher than you need for the recognition problem you have or your machine may be way too underpowered or maybe you just want a 10% speed up, etc. you didn't really say too much.
Reply all
Reply to author
Forward
0 new messages