Tesseract performance (speed and accuracy)

viraf

unread,

Feb 14, 2016, 11:15:12 AM2/14/16

to tesseract-ocr

I am new to tesseract and using it through Tess4J. I am trying to OCR faxes where pages are represented as TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW).

I have two set of questions

Speed

On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread. I was looking for suggestions on how to speed up page processing. I use parallelStream to process each page in a separate thread,

Training

I am trying to learn about training Tesseract for improved accuracy. Given that the fonts / box files used to generate eng.traindata are not available can one specify the fonts used for english?

Also, is there a description of the various training artifacts ? I used "combine_tessdata -u" to unpack eng.traindata and "dawg2wordlist" to extract thee wordlist, however was looking for documentation to better understand the various training artifacts.

Thanks

- viraf

Tom Morris

unread,

Feb 15, 2016, 1:22:57 PM2/15/16

to tesseract-ocr

On Sunday, February 14, 2016 at 11:15:12 AM UTC-5, viraf wrote:

Speed
On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread. I was looking for suggestions on how to speed up page processing. I use parallelStream to process each page in a separate thread,

You don't say what resolution or format images, what language(s), what version of Tesseract -- all of which are pretty critical when discussing performance. Having said that, I just ran a 110 page document in 272 seconds on a recent MacBook Pro. There were ~100 pages of mixed density text totalling 160k characters in CCITT G4 fax bitonal images of 2550x3300 pixels.

That's four times the speed you quote, so I suspect you're reinitializing Tesseract for every page or taking a big hit on image processing or something else unrelated to the core OCR engine.

Training
I am trying to learn about training Tesseract for improved accuracy. Given that the fonts / box files used to generate eng.traindata are not available can one specify the fonts used for english?

The font list is included in the eng.inttemp file that you extracted. Given that it's something like 350 fonts, you'd have to be looking at a pretty exotic font to need to retrain for that reason.

Also, is there a description of the various training artifacts ? I used "combine_tessdata -u" to unpack eng.traindata and "dawg2wordlist" to extract thee wordlist, however was looking for documentation to better understand the various training artifacts.

Have you reviewed the training documentation on the wiki?

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

Tom

viraf

unread,

Feb 15, 2016, 8:24:48 PM2/15/16

to tesseract-ocr

Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW). Th language is english. I am using Tess4j 3.0, which includes Tesseract 3.0.4. I am instantiating a new Tesseract object for each page, however the cost was minimal (74ms) for the total run. I'll investigate further whether the Java API's are calling init elsewhere.

When you state "taking a big hit on image processing" how would I be able to isolate the issue to image processing?

Thanks for your help.

- viraf

viraf

unread,

Feb 15, 2016, 9:23:09 PM2/15/16

to tesseract-ocr

Also wanted to clarify that your 24PPM was obtained on a single thread, and did not leverage GPU. Thanks - viraf

On Monday, February 15, 2016 at 1:22:57 PM UTC-5, Tom Morris wrote:

Tom Morris

unread,

Feb 16, 2016, 1:53:40 AM2/16/16

to tesser...@googlegroups.com

On Mon, Feb 15, 2016 at 8:24 PM, viraf <viraf.b...@gmail.com> wrote:

Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW). Th language is english.

So, roughly the same resolution and format as I used, but only 1/4 the speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz Intel Core i7 (and no, it's not using OpenCL, the GPU, or multiple threads).

I am using Tess4j 3.0, which includes Tesseract 3.0.4. I am instantiating a new Tesseract object for each page, however the cost was minimal (74ms) for the total run.

I'm not familiar with the Tess4J wrapper, but that sounds pretty low for initialization cost. Are you sure you're measuring the true cost (ie you're not being fooled by lazy initialization)? What happens when you combine all the pages into a single multi-page TIFF and OCR it (so you can be sure you've amortized the initialization cost)?

When you state "taking a big hit on image processing" how would I be able to isolate the issue to image processing?

I was mainly talking about operations like thresholding, format conversion, etc to get to a usable image. That's obviously not applicable if you're working with bitonal images (which you hadn't disclosed when I wrote my reply).

viraf

unread,

Feb 16, 2016, 8:17:53 AM2/16/16

to tesseract-ocr

Thanks for the clarification. I now know that 24 PPM on a single thread should be achievable. I'll update the post after trying a few options.

Thanks for your help.

- viraf

viraf

unread,

Feb 16, 2016, 9:11:06 AM2/16/16

to tesseract-ocr

I ran a test with a multipage tiiff, and am getting the same results of approximately 6 PPM.

I used the following command to create the multipage TIFF

gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300 /media/sf_shared/00473706.PDF

and ran it under Windows and Linux. Here is the Linux output:

Tue Feb 16 08:55:14 EST 2016

Tesseract Open Source OCR Engine v3.04.00 with Leptonica

Page 1

Page 2

Page 3

Page 4

Page 5

Page 6

Page 7

Page 8

Page 9

Page 10

Page 11

Page 12

Page 13

OSD: Weak margin (4.51) for 95 blob text block, but using orientation anyway: 0

Page 14

Page 15

Page 16

Page 17

Page 18

Page 19

OSD: Weak margin (6.28) for 1715 blob text block, but using orientation anyway: 0

Page 20

OSD: Weak margin (2.15) for 1383 blob text block, but using orientation anyway: 0

Page 21

Page 22

Tue Feb 16 08:59:24 EST 2016

You had mentioned spending time on image processing, so was wondering what the "OSD Weak Margin" messages mean. The script used to OCR is

date

tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr

date

Any suggestions on where to investigate next would be appreciated.

Thanks

- viraf

Tom Morris

unread,

Feb 16, 2016, 10:31:13 AM2/16/16

to tesser...@googlegroups.com

My pipeline for this kind of stuff uses:

pdfimages - to extract the images

faxtotiff - to convert CCITT to TIFF (using the parameters file generated by pdfimages)

tiffcp - to concatenate multiple TIFFs together into one big one

but the important thing is the resulting TIFF. You could try running tiffinfo on it to see if anything looks funny. One thing I wonder about is the 300x300 resolution. My images are the standard (for fax), 204x196 pixels/inch, so you've got double the pixels to start. That's likely one factor of 2 right there. Having Ghostscript do a full rendering at that resolution with the necessary image transforms can't be very fast. My pipeline takes 5 seconds for a 110 page document. Also, depending on what your starting resolution is, any image scaling is likely degrading the image quality.

It seems unlikely that there have been huge performance changes in the last six months, but you could try building from source to see if it makes a difference. I'm using the latest 3.05 head sources from Github.

Tom

p.s. One caveat - I think faxtotiff, as distributed, is broken and I haven't had a chance to contribute my fixes back upstream yet.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

viraf

unread,

Feb 16, 2016, 11:09:32 AM2/16/16

to tesseract-ocr

My timings were just for Tesseract to process the image. I tried using standard Fax settings which improved processing time to about 8 PPM. I was using 300 dpi as per recommendations on many forum postings. Enclosed is the tiffinfo for the

TIFF Directory at offset 0x8 (8)

Subfile Type: multi-page document (2 = 0x2)

Image Width: 1728 Image Length: 2292

Resolution: 204, 196 pixels/inch

Bits/Sample: 1

Compression Scheme: CCITT Group 4

Photometric Interpretation: min-is-white

FillOrder: msb-to-lsb

Orientation: row 0 top, col 0 lhs

Samples/Pixel: 1

Rows/Strip: 4969

Planar Configuration: single image plane

Page Number: 0-0

Software: GPL Ghostscript 9.16

DateTime: 2016:02:16 10:43:39

Group 4 Options: (0 = 0x0)

I'll look at building a new release - but that has its own challenges as it is not a release. Do you have any other suggestions for me to consider? Do you know if there are sample images that were used for testing, where we have some metrics on speed. This would help me isolate the problem to the images or to my build.

viraf

unread,

Feb 16, 2016, 11:56:15 AM2/16/16

to tesseract-ocr

Tom, on the item of fonts, eng.inttemp is a binary file in 3.0.4. I did not see a command to extract its contents. Do you have suggestions on how to review this file ? Thanks - viraf

On Monday, February 15, 2016 at 1:22:57 PM UTC-5, Tom Morris wrote:

Tom Morris

unread,

Feb 16, 2016, 1:13:01 PM2/16/16

to tesser...@googlegroups.com

Actually, I think the resolution specified in my TIFFs is a red herring and wrong, because the image sizes are the same as your originals. I'm not aware of any standard images and test timings. There are two test images in the source repo, but they're too small to be useful for any type of performance work.

For the record, here's what my TIFF images look like:

TIFF Directory at offset 0xabd56a (11261290)

Image Width: 3400 Image Length: 4401

Resolution: 204, 196 pixels/inch

Bits/Sample: 1

Compression Scheme: CCITT Group 3

Photometric Interpretation: min-is-white

FillOrder: lsb-to-msb

Orientation: row 0 top, col 0 lhs

Samples/Pixel: 1

Rows/Strip: (infinite)

Planar Configuration: single image plane

Page Number: 1-0

Software: fax2tiff

Group 3 Options: (0 = 0x0)

Fax Data: clean (0 = 0x0)

Bad Fax Lines: 0

Consecutive Bad Fax Lines: 0

I don't think there's anything significant difference in the images. Just for grins I reinstalled the 3.04.00 MacPorts version of tesseract and it took 3min21sec for the same file that takes 4min05sec with the current development build, so it doesn't look like there have been any recent performance improvements and perhaps even the opposite (hmmm).

I think I've exhausted my easy suggestions for remote control (free) performance analysis, but I'm interested in hearing what, if anything, you find out.

Tom

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/008626e5-6017-45da-a5d2-d42c58834216%40googlegroups.com.

Tom Morris

unread,

Feb 16, 2016, 1:13:54 PM2/16/16

to tesser...@googlegroups.com

On Tue, Feb 16, 2016 at 11:56 AM, viraf <viraf.b...@gmail.com> wrote:

Tom, on the item of fonts, eng.inttemp is a binary file in 3.0.4. I did not see a command to extract its contents. Do you have suggestions on how to review this file ? Thanks - viraf

You can use the strings command or just open it in emacs and search for Courier or some other well-known font. All the fonts are listed together and will be obvious when you see them.

Tom

viraf

unread,

Feb 16, 2016, 3:40:44 PM2/16/16

to tesseract-ocr

Thanks - I appreciate your help. I ran perf tool and noticed that 40% of the time is spent in IntegerMatcher::UpdateTablesForFeatures.

Can you try to see if you get the same results on a non mac? Someone suggested that the Mac may automatically use the co-processor.

Thanks

- viraf

Tom Morris

unread,

Feb 18, 2016, 12:18:37 PM2/18/16

to tesseract-ocr

On Tuesday, February 16, 2016 at 3:40:44 PM UTC-5, viraf wrote:

I ran perf tool and noticed that 40% of the time is spent in IntegerMatcher::UpdateTablesForFeatures.

http://lmgtfy.com/?q=IntegerMatcher%3A%3AUpdateTablesForFeatures

https://groups.google.com/forum/#!topic/tesseract-dev/zR2Lv0_LF68

Quan Nguyen

unread,

Feb 18, 2016, 9:58:12 PM2/18/16

to tesseract-ocr

If you can reduce or minimize initializing and disposing of Tesseract native instances for every run, you can achieve significant performance increase.

https://sourceforge.net/p/tess4j/discussion/1202294/thread/d32bd579/

On Sunday, February 14, 2016 at 10:15:12 AM UTC-6, viraf wrote:

I am new to tesseract and using it through Tess4J. I am trying to OCR faxes where pages are represented as TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW).

I have two set of questions

Speed
On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread. I was looking for suggestions on how to speed up page processing. I use parallelStream to process each page in a separate thread,

- viraf

viraf

unread,

Feb 19, 2016, 7:50:09 AM2/19/16

to tesseract-ocr

Thanks - I will investigate further. Initial test that I ran based on Tom's input showed around the same performance (I used a multi-page TIFF), however the article you referenced indicated a speedup factor of 2x.

Is there a way to have Tesseract to process the pages in parallel ?

viraf

unread,

Feb 19, 2016, 11:36:16 AM2/19/16

to tesseract-ocr

I created a large (1800 page) multi-page tiff and am feeding it to Tesseract via command line (on Ubuntu). This way I am testing Tesseract performance. I am still getting about 5/6 PPM. I will run the test on another machine to see if the performance is the same. Is this the performance that you are seeing for similar pages (details in thread above). This is about 25% the performance of a commercial engine that I am evaluating (it gets about 24 PPM with 2 cores on my laptop), and its accuracy is significantly better.

- viraf

Tom Morris

unread,

Feb 19, 2016, 12:45:04 PM2/19/16

to tesser...@googlegroups.com

On Fri, Feb 19, 2016 at 11:36 AM, viraf <viraf.b...@gmail.com> wrote:

I created a large (1800 page) multi-page tiff and am feeding it to Tesseract via command line (on Ubuntu). This way I am testing Tesseract performance.

Is that representative of the documents that you work with? The multi-page TIFF buffering in Tesseract is messed up. I just created this issue to describe the problem: https://github.com/tesseract-ocr/tesseract/issues/233

This is about 25% the performance of a commercial engine that I am evaluating (it gets about 24 PPM with 2 cores on my laptop),

What's the price/performance ratio? :-)

Tom

viraf

unread,

Feb 19, 2016, 3:00:42 PM2/19/16

to tesseract-ocr

Tom, I created a multi-page TIFF as per earlier recommendation on this thread (avoid multiple inits). Running it on Linux from the command line provided me with a reference by which to compute PPM that I could target with Tess4J. I had hoped to get 10+ PPM / core and shift focus on accuracy. I am at about 6 PPM and unclear where / how to improve performance (speed).

- viraf

Tom Morris

unread,

Feb 20, 2016, 11:55:43 AM2/20/16

to tesseract-ocr

On Friday, February 19, 2016 at 3:00:42 PM UTC-5, viraf wrote:

Tom, I created a multi-page TIFF as per earlier recommendation on this thread (avoid multiple inits). Running it on Linux from the command line provided me with a reference by which to compute PPM that I could target with Tess4J. I had hoped to get 10+ PPM / core and shift focus on accuracy. I am at about 6 PPM and unclear where / how to improve performance (speed).

I take it the question about the representativeness of that size file was too sensitive/boring/trivial/... to answer.

Given the issues with multi-page TIFFs, one experiment worth running is to try a list of single page TIFFs instead of one ridiculously large file.

$ cat > filelist.txt

page0001.tif

page0002.tif

...

page1800.tif

$ tesseract filelist.txt

Tom

viraf

unread,

Feb 21, 2016, 11:15:52 AM2/21/16

to tesseract-ocr

1800 pages is on the larger side. Files can range from a few pages to > 1800 pages. Initial tests were done with a document of 22 pages. I ran a test you outlined below on a 372 page file on a linux guest VM using Tesseract 3.04 and results were disappointing (approx 3 PPM). I then ran my initial test application with Tess4J on the 372 pages and results were approximately 9 PPM. The init does not appear to be as expensive as thought -

Pages 372	Time (ms) 2395903	PPM 9.315903
372	2293524	9.731749

The first run was with instantiating a new engine for each page and calling init/setTessVariables and disposing at the end. The second run was with allocation, init/setTessVariables and disposing moved out of the loop. I am calling ProcessPage specifying a text renderer (earlier test generated hocr and pdf file).

So, I will deploy this code on the Linux guest VM and see if I get similar results. The speed difference could be related to tesseract build options between windows and Linux.

- viraf

Mike Lissner

unread,

Aug 20, 2016, 6:31:41 AM8/20/16

to tesseract-ocr

Viraf, I'm bringing this thread back from the dead, but did you ever figure out how to squeeze out more performance from Tesseract?

Tomy Chacko

unread,

Jan 16, 2017, 12:32:52 AM1/16/17

to tesseract-ocr

Hi All,

I am watching this thread regards to performance of tesseract. We are processing large PDF (100 of pages and each page is converted to BMP) and sent to tesseract for processing one by one.

I am interested in only identifying the orientation of the text in the image and do rotation of the image based on the orientation identified.

I could see that each of the image takes nearly 3 secs on an average. So a hundred page PDF will take around 275 - 300 secs. Isn't this a bit too high?

I am using the .NET tesseract wrapper 3.0.2 now. Do we have a latest release version available and will it improve performance?

Again, my whole tesseract functionaliy is implemented in .NET assembly (DLL) which is then called from our Delphi client.

I understand that the tesseract init process is a bit costly, but wondering how to Init only once in the .NET assembly (DLL) and use it for all pages on the PDF so I can save time while sending

subsequent pages from Delphi for processing from the .NET assembly?

Ta

Tomy

Reply all

Reply to author

Forward