Tesseract speed

207 views
Skip to first unread message

Priya

unread,
May 7, 2020, 1:29:56 PM5/7/20
to tesseract-ocr
Tesseract 4 is too slow. It takes almost 3-4 sec to process single page. Can anyone suggest any method to speed up?

himanshu chawla

unread,
May 7, 2020, 1:33:04 PM5/7/20
to tesser...@googlegroups.com
try setting OMP_THREAD_LIMIT=1 in environment variables.



On Thu, May 7, 2020, 22:59 Priya <shrutij...@gmail.com> wrote:
Tesseract 4 is too slow. It takes almost 3-4 sec to process single page. Can anyone suggest any method to speed up?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5577ca67-efb1-4a78-8749-341cd1700e04%40googlegroups.com.

farhad khalafi

unread,
May 7, 2020, 2:51:15 PM5/7/20
to tesseract-ocr
3-4 seconds for a single page is probably not that slow depending on the page content and layout. 

We have a huge OCR project with approximately 16 million images to process. Our configuration has 4 virtual machines each with 8 cores and 16GB memory. They work against a single input queue and process 8 pages at a time per VM. The CPU utilization on each VM is kept close to 100% by design. 

We use a producer/consumer model with multiple threads and can process in excess of 10,000 pages per hour (we even do multiple OCR passes on certain pages where the orientation is uncertain). This set up did require fair amount of development to create but is working fine and has processed more than 3 million pages already. Tesseract and Leptonica engines have performed flawlessly in this rather demanding setup. 

The point is that a single page on a single thread may feel slow, but a batch process with a large degree of parallelism performs rather well.

On Thursday, May 7, 2020 at 11:33:04 AM UTC-6, himanshu chawla wrote:
try setting OMP_THREAD_LIMIT=1 in environment variables.



On Thu, May 7, 2020, 22:59 Priya <shrutij...@gmail.com> wrote:
Tesseract 4 is too slow. It takes almost 3-4 sec to process single page. Can anyone suggest any method to speed up?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages