Extracting pristine rasterized text

Patrick Ramsey

unread,

Mar 30, 2018, 8:49:57 AM3/30/18

to tesseract-ocr

Hi!

So, I am running tesseract4 on clean, 1-bit images of rasterized text (not printed and scanned). I'm getting very accurate output, as expected, but tesseract is taking about 1 second to process a single page on a core i7 cpu, and that seems a lot longer than I'd have expected.

I've been trying to enable debug output so that I can see what's taking the most time, to see if there is anything that I could get away with turning off to speed it up (since I don't need to account for e.g. dirt on the lens), but thus far I'm feeling pretty stupid. So:

A) is there any straightforward way to get more information on what tesseract is actually doing? (I've built with --enable-debug and it doesn't seem to have changed the output on the command line)
B) are there any control parameters you folks would suggest setting to speed up image processing/turn off unnecessary work, given the inputs I've described?

Many thanks,

PTR

ShreeDevi Kumar

unread,

Mar 30, 2018, 9:00:18 AM3/30/18

to tesser...@googlegroups.com

Please check GitHub/issues for similar reports and suggestions.

Also specify,

Which version/commit of tesseract 4

Which traineddata file, from which repo

Which o/s

tesseract -v

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/893cf5f7-8f64-428e-b1fe-5e6214215059%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Mar 30, 2018, 9:01:14 AM3/30/18

to tesser...@googlegroups.com

Please also note that -enable-debug by itself will make it slower.

Robert Komar

unread,

Mar 30, 2018, 7:26:32 PM3/30/18

to tesseract-ocr

If you're using linux, then "man gprof" will tell you how
to get profile data that shows where the program is spending
its time. Enabling debugging will help you step through
the code as it runs, but that gives only a rough (and
maybe inaccurate) guess about what takes a long time to
compute.

If you don't want to rebuild tesseract with profiling
enabled, then the "oprofile" package on linux can be used
to get profiling data. It's more complicated than gprof,
but also much more powerful.

Cheers,
Rob Komar

On Thu, 29 Mar 2018, Patrick Ramsey wrote:

> Hi!
>
> So, I am running tesseract4 on clean, 1-bit images of

> rasterized text (not printed and scanned).? I'm getting very

> accurate output, as expected, but tesseract is taking
> about 1 second to process a single page on a core i7 cpu,

> and that seems a lot longer than I'd have expected.?

>
> I've been trying to enable debug output so that I can see
> what's taking the most time, to see if there is anything
> that I could get away with turning off to speed it up
> (since I don't need to account for e.g. dirt on the lens),

> but thus far I'm feeling pretty stupid.? So:

Patrick Ramsey

unread,

Apr 2, 2018, 9:43:13 PM4/2/18

to tesseract-ocr

Answers below inline. And thank you very much for your help :)

|PTR

On Friday, March 30, 2018 at 2:00:18 AM UTC-7, shree wrote:

Please check GitHub/issues for similar reports and suggestions.

Also specify,

Which version/commit of tesseract 4

commit hash: 40f43111e05b3dd2f2f8aeae3aba33016523c881
tag: 4.0.0-beta.1

Which traineddata file, from which repo

eng.traineddata from https://github.com/tesseract-ocr/tessdata at commit 9b2e3f6642285b3e9a7a5852e5b10259e42d5510

Which o/s

Ubuntu 17.10 on amd64

tesseract -v

tesseract 4.0.0-beta.1
leptonica-1.74.4
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.2.0

Found AVX2
Found AVX
Found SSE

ShreeDevi Kumar

unread,

Apr 3, 2018, 1:17:00 AM4/3/18

to tesser...@googlegroups.com

Thank you for the detailed info.

My suggestion is to try recognition with eng.traineddata from the tessdata_fast repository with --oem 1.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c709dd21-02d4-4d23-a52a-60501916c37a%40googlegroups.com.

Reply all

Reply to author

Forward