Guidance to improve speed.

84 views
Skip to first unread message

Vijender Chaudhary

unread,
Aug 18, 2021, 6:21:50 AM8/18/21
to tesseract-ocr

Hi,

I am working on printed document of English language. I need to extract all text from that image, but with simple tesseract it is taking 4 sec. Is it possible to fine tune for only English alphabet and numbers? Please help.

Helmut Wollmersdorfer

unread,
Aug 18, 2021, 8:19:05 AM8/18/21
to tesseract-ocr
The default language model of Tesseract is the one for English. It's the same you get with the command line option '-l eng'. This model uses a reasonable small character set of letters, punctuation, numbers and some symbols.

You can save a small amount of time with smaller resolution, because there will not be much difference in OCR quality between 150 and 300 dpi. But converting them down also needs time, maybe more. The largest factor for the needed time is the number of characters in the page.

Reply all
Reply to author
Forward
0 new messages