Tesseract performance (speed and accuracy)

7,966 views
Skip to first unread message

viraf

unread,
Feb 14, 2016, 11:15:12 AM2/14/16
to tesseract-ocr
I am new to tesseract and using it through Tess4J.  I am trying to OCR faxes where pages are represented as TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW).  

I have two set of questions

Speed
On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread.  I was looking for suggestions on how to speed up page processing.  I use parallelStream to process each page in a separate thread,

Training
I am trying to learn about training Tesseract for improved accuracy.  Given that the fonts / box files used to generate eng.traindata are not available can one specify the fonts used for english?  
Also, is there a description of the various training artifacts ?  I used "combine_tessdata -u" to unpack eng.traindata and  "dawg2wordlist" to extract thee wordlist, however was looking for documentation to better understand the various training artifacts.

Thanks

- viraf

Tom Morris

unread,
Feb 15, 2016, 1:22:57 PM2/15/16
to tesseract-ocr


On Sunday, February 14, 2016 at 11:15:12 AM UTC-5, viraf wrote:

Speed
On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread.  I was looking for suggestions on how to speed up page processing.  I use parallelStream to process each page in a separate thread,

You don't say what resolution or format images, what language(s), what version of Tesseract -- all of which are pretty critical when discussing performance.  Having said that, I just ran a 110 page document in 272 seconds on a recent MacBook Pro.  There were ~100 pages of mixed density text totalling 160k characters in CCITT G4 fax bitonal images of 2550x3300 pixels.

That's four times the speed you quote, so I suspect you're reinitializing Tesseract for every page or taking a big hit on image processing or something else unrelated to the core OCR engine.
 

Training
I am trying to learn about training Tesseract for improved accuracy.  Given that the fonts / box files used to generate eng.traindata are not available can one specify the fonts used for english?  

The font list is included in the eng.inttemp file that you extracted. Given that it's something like 350 fonts, you'd have to be looking at a pretty exotic font to need to retrain for that reason.
 
Also, is there a description of the various training artifacts ?  I used "combine_tessdata -u" to unpack eng.traindata and  "dawg2wordlist" to extract thee wordlist, however was looking for documentation to better understand the various training artifacts.

Have you reviewed the training documentation on the wiki?


Tom
 

viraf

unread,
Feb 15, 2016, 8:24:48 PM2/15/16
to tesseract-ocr
Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW). Th language is english.  I am using Tess4j 3.0, which includes Tesseract 3.0.4.  I am instantiating a new Tesseract object for each page, however the cost was minimal (74ms) for the total run.  I'll investigate further whether the Java API's are calling init elsewhere.  

When you state "taking a big hit on image processing" how would I be able to isolate the issue to image processing?  

Thanks for your help.  

- viraf

viraf

unread,
Feb 15, 2016, 9:23:09 PM2/15/16
to tesseract-ocr
Also wanted to clarify that your 24PPM was obtained on a single thread, and did not leverage GPU.  Thanks - viraf


On Monday, February 15, 2016 at 1:22:57 PM UTC-5, Tom Morris wrote:

Tom Morris

unread,
Feb 16, 2016, 1:53:40 AM2/16/16
to tesser...@googlegroups.com
On Mon, Feb 15, 2016 at 8:24 PM, viraf <viraf.b...@gmail.com> wrote:
Tom, the images are TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW). Th language is english.  

So, roughly the same resolution and format as I used, but only 1/4 the speed. My test machine calls itself a mid-2014 MBP with 2.5 GHz Intel Core i7 (and no, it's not using OpenCL, the GPU, or multiple threads).
 
I am using Tess4j 3.0, which includes Tesseract 3.0.4.  I am instantiating a new Tesseract object for each page, however the cost was minimal (74ms) for the total run.  

I'm not familiar with the Tess4J wrapper, but that sounds pretty low for initialization cost. Are you sure you're measuring the true cost (ie you're not being fooled by lazy initialization)? What happens when you combine all the pages into a single multi-page TIFF and OCR it (so you can be sure you've amortized the initialization cost)?

When you state "taking a big hit on image processing" how would I be able to isolate the issue to image processing?  

I was mainly talking about operations like thresholding, format conversion, etc to get to a usable image.  That's obviously not applicable if you're working with bitonal images (which you hadn't disclosed when I wrote my reply).

viraf

unread,
Feb 16, 2016, 8:17:53 AM2/16/16
to tesseract-ocr
Thanks for the clarification.  I now know that 24 PPM on a single thread should be achievable.  I'll update the post after trying a few options.  
Thanks for your help.

- viraf

viraf

unread,
Feb 16, 2016, 9:11:06 AM2/16/16
to tesseract-ocr
I ran a test with a multipage tiiff, and am getting the same results of approximately 6 PPM.  
I used the following command to create the multipage TIFF
  gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 -r300x300 /media/sf_shared/00473706.PDF

and ran it under Windows and Linux.  Here is the Linux output:

Tue Feb 16 08:55:14 EST 2016
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
OSD: Weak margin (4.51) for 95 blob text block, but using orientation anyway: 0
Page 14
Page 15
Page 16
Page 17
Page 18
Page 19
OSD: Weak margin (6.28) for 1715 blob text block, but using orientation anyway: 0
Page 20
OSD: Weak margin (2.15) for 1383 blob text block, but using orientation anyway: 0
Page 21
Page 22
Tue Feb 16 08:59:24 EST 2016

You had mentioned spending time on image processing, so was wondering what the "OSD Weak Margin" messages mean.  The script used to OCR is

date
tesseract /media/sf_shared/multipage-tiffg4.tif out -l eng hocr
date

Any suggestions on where to investigate next would be appreciated.

Thanks

- viraf

Tom Morris

unread,
Feb 16, 2016, 10:31:13 AM2/16/16
to tesser...@googlegroups.com
My pipeline for this kind of stuff uses:

    pdfimages - to extract the images
    faxtotiff - to convert CCITT to TIFF (using the parameters file generated by pdfimages)
    tiffcp - to concatenate multiple TIFFs together into one big one

but the important thing is the resulting TIFF. You could try running tiffinfo on it to see if anything looks funny.  One thing I wonder about is the 300x300 resolution.  My images are the standard (for fax), 204x196 pixels/inch, so you've got double the pixels to start.  That's likely one factor of 2 right there. Having Ghostscript do a full rendering at that resolution with the necessary image transforms can't be very fast. My pipeline takes 5 seconds for a 110 page document. Also, depending on what your starting resolution is, any image scaling is likely degrading the image quality.

It seems unlikely that there have been huge performance changes in the last six months, but you could try building from source to see if it makes a difference. I'm using the latest 3.05 head sources from Github.

Tom

p.s. One caveat - I think faxtotiff, as distributed, is broken and I haven't had a chance to contribute my fixes back upstream yet.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/5CSIYkba5Dc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9b6dda7-740d-4d66-8b45-a632e9c8dc11%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

viraf

unread,
Feb 16, 2016, 11:09:32 AM2/16/16
to tesseract-ocr
My timings were just for Tesseract to process the image.  I tried using standard Fax settings which improved processing time to about 8 PPM.  I was using 300 dpi as per recommendations on many forum postings.  Enclosed is the tiffinfo for the 

TIFF Directory at offset 0x8 (8)
  Subfile Type: multi-page document (2 = 0x2)
  Image Width: 1728 Image Length: 2292
  Resolution: 204, 196 pixels/inch
  Bits/Sample: 1
  Compression Scheme: CCITT Group 4
  Photometric Interpretation: min-is-white
  FillOrder: msb-to-lsb
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: 4969
  Planar Configuration: single image plane
  Page Number: 0-0
  Software: GPL Ghostscript 9.16
  DateTime: 2016:02:16 10:43:39
  Group 4 Options: (0 = 0x0)

I'll look at building a new release - but that has its own challenges as it is not a release.  Do you have any other suggestions for me to consider?  Do you know if there are sample images that were used for testing, where we have some metrics on speed.  This would help me isolate the problem to the images or to my build.

viraf

unread,
Feb 16, 2016, 11:56:15 AM2/16/16
to tesseract-ocr
Tom, on the item of fonts, eng.inttemp  is a binary file in 3.0.4.  I did not see a command to extract its contents.  Do you have suggestions on how to review this file ?  Thanks - viraf


On Monday, February 15, 2016 at 1:22:57 PM UTC-5, Tom Morris wrote:

Tom Morris

unread,
Feb 16, 2016, 1:13:01 PM2/16/16
to tesser...@googlegroups.com
Actually, I think the resolution specified in my TIFFs is a red herring and wrong, because the image sizes are the same as your originals. I'm not aware of any standard images and test timings.  There are two test images in the source repo, but they're too small to be useful for any type of performance work.

For the record, here's what my TIFF images look like:

TIFF Directory at offset 0xabd56a (11261290)
  Image Width: 3400 Image Length: 4401
  Resolution: 204, 196 pixels/inch
  Bits/Sample: 1
  Compression Scheme: CCITT Group 3
  Photometric Interpretation: min-is-white
  FillOrder: lsb-to-msb
  Orientation: row 0 top, col 0 lhs
  Samples/Pixel: 1
  Rows/Strip: (infinite)
  Planar Configuration: single image plane
  Page Number: 1-0
  Software: fax2tiff
  Group 3 Options: (0 = 0x0)
  Fax Data: clean (0 = 0x0)
  Bad Fax Lines: 0
  Consecutive Bad Fax Lines: 0

I don't think there's anything significant difference in the images. Just for grins I reinstalled the 3.04.00 MacPorts version of tesseract and it took 3min21sec for the same file that takes 4min05sec with the current development build, so it doesn't look like there have been any recent performance improvements and perhaps even the opposite (hmmm).

I think I've exhausted my easy suggestions for remote control (free) performance analysis, but I'm interested in hearing what, if anything, you find out.

Tom

Tom Morris

unread,
Feb 16, 2016, 1:13:54 PM2/16/16
to tesser...@googlegroups.com
On Tue, Feb 16, 2016 at 11:56 AM, viraf <viraf.b...@gmail.com> wrote:
Tom, on the item of fonts, eng.inttemp  is a binary file in 3.0.4.  I did not see a command to extract its contents.  Do you have suggestions on how to review this file ?  Thanks - viraf

You can use the strings command or just open it in emacs and search for Courier or some other well-known font.  All the fonts are listed together and will be obvious when you see them.

Tom 

viraf

unread,
Feb 16, 2016, 3:40:44 PM2/16/16
to tesseract-ocr
Thanks - I appreciate your help.  I ran perf tool and noticed that 40% of the time is spent in IntegerMatcher::UpdateTablesForFeatures.  

Can you try to see if you get the same results on a non mac?  Someone suggested that the Mac may automatically use the co-processor.

Thanks

- viraf

Tom Morris

unread,
Feb 18, 2016, 12:18:37 PM2/18/16
to tesseract-ocr
On Tuesday, February 16, 2016 at 3:40:44 PM UTC-5, viraf wrote:
I ran perf tool and noticed that 40% of the time is spent in IntegerMatcher::UpdateTablesForFeatures.  

Quan Nguyen

unread,
Feb 18, 2016, 9:58:12 PM2/18/16
to tesseract-ocr
If you can reduce or minimize initializing and disposing of Tesseract native instances for every run, you can achieve significant performance increase.



On Sunday, February 14, 2016 at 10:15:12 AM UTC-6, viraf wrote:
I am new to tesseract and using it through Tess4J.  I am trying to OCR faxes where pages are represented as TIFF (CCITT T.6) images - 2509 x 3530 @ 300 dpi (1 bit - i.e. BW).  

I have two set of questions

Speed
On an intel i7-4800 MQ @ 2.7GHz I am getting approximately 6 PPM using 1 thread.  I was looking for suggestions on how to speed up page processing.  I use parallelStream to process each page in a separate thread,


- viraf

viraf

unread,
Feb 19, 2016, 7:50:09 AM2/19/16
to tesseract-ocr
Thanks - I will investigate further.  Initial test that I ran based on Tom's input showed around the same performance (I used a multi-page TIFF), however the article you referenced indicated a speedup factor of 2x.  

Is there a way to have Tesseract to process the pages in parallel ?

viraf

unread,
Feb 19, 2016, 11:36:16 AM2/19/16
to tesseract-ocr
I created a large (1800 page) multi-page tiff and am feeding it to Tesseract via command line (on Ubuntu).  This way I am testing Tesseract performance.  I am still getting about 5/6 PPM.  I will run the test on another machine to see if the performance is the same.  Is this the performance that you are seeing for similar pages (details in thread above).  This is about 25% the performance of a commercial engine that I am evaluating (it gets about 24 PPM with 2 cores on my laptop), and its accuracy is significantly better.

- viraf

Tom Morris

unread,
Feb 19, 2016, 12:45:04 PM2/19/16
to tesser...@googlegroups.com
On Fri, Feb 19, 2016 at 11:36 AM, viraf <viraf.b...@gmail.com> wrote:
I created a large (1800 page) multi-page tiff and am feeding it to Tesseract via command line (on Ubuntu).  This way I am testing Tesseract performance.

Is that representative of the documents that you work with? The multi-page TIFF buffering in Tesseract is messed up. I just created this issue to describe the problem: https://github.com/tesseract-ocr/tesseract/issues/233
 
This is about 25% the performance of a commercial engine that I am evaluating (it gets about 24 PPM with 2 cores on my laptop), 

What's the price/performance ratio? :-)

Tom 

viraf

unread,
Feb 19, 2016, 3:00:42 PM2/19/16
to tesseract-ocr
Tom, I created a multi-page TIFF as per earlier recommendation on this thread (avoid multiple inits).  Running it on Linux from the command line provided me with a reference by which to compute PPM that I could target with Tess4J.  I had hoped to get 10+ PPM / core and shift focus on accuracy.  I am at about 6 PPM and unclear where / how to improve performance (speed).  

- viraf

Tom Morris

unread,
Feb 20, 2016, 11:55:43 AM2/20/16
to tesseract-ocr
On Friday, February 19, 2016 at 3:00:42 PM UTC-5, viraf wrote:
Tom, I created a multi-page TIFF as per earlier recommendation on this thread (avoid multiple inits).  Running it on Linux from the command line provided me with a reference by which to compute PPM that I could target with Tess4J.  I had hoped to get 10+ PPM / core and shift focus on accuracy.  I am at about 6 PPM and unclear where / how to improve performance (speed).  

I take it the question about the representativeness of that size file was too sensitive/boring/trivial/... to answer. 

Given the issues with multi-page TIFFs, one experiment worth running is to try a list of single page TIFFs instead of one ridiculously large file.

$ cat > filelist.txt
page0001.tif
page0002.tif
...
page1800.tif

$ tesseract filelist.txt

Tom

viraf

unread,
Feb 21, 2016, 11:15:52 AM2/21/16
to tesseract-ocr
1800 pages is on the larger side.  Files can range from a few pages to > 1800 pages.  Initial tests were done with a document of 22 pages.  I ran a test you outlined below on a 372 page file on a linux guest VM using Tesseract 3.04 and results were disappointing (approx 3 PPM).  I then ran my initial test application with Tess4J on the 372 pages and results were approximately 9 PPM.  The init does not appear to be as expensive as thought - 


Pages
372
Time (ms)
2395903
PPM
9.315903
372 2293524 9.731749

The first run was with instantiating a new engine for each page and calling init/setTessVariables and disposing at the end.  The second run was with allocation, init/setTessVariables and disposing moved out of the loop.  I am calling ProcessPage specifying a text renderer (earlier test generated hocr and pdf file).

So, I will deploy this code on the Linux guest VM and see if I get similar results.  The speed difference could be related to tesseract build options between windows and Linux.  

- viraf

Mike Lissner

unread,
Aug 20, 2016, 6:31:41 AM8/20/16
to tesseract-ocr
Viraf, I'm bringing this thread back from the dead, but did you ever figure out how to squeeze out more performance from Tesseract?

Tomy Chacko

unread,
Jan 16, 2017, 12:32:52 AM1/16/17
to tesseract-ocr
Hi All,

    I am watching this thread regards to performance of tesseract. We are processing large PDF (100 of pages and each page is converted to BMP) and sent to tesseract for processing one by one. 
    I am interested in only identifying the orientation of the text in the image and do rotation of the image based on the orientation identified. 

    I could see that each of the image takes nearly 3 secs on an average. So a hundred page PDF will take around 275 - 300 secs. Isn't this a bit too high?

    I am using the .NET tesseract wrapper 3.0.2 now. Do we have a latest release version available and will it improve performance?

    Again, my whole tesseract functionaliy is implemented in .NET assembly (DLL) which is then called from our Delphi client. 

    I understand that the tesseract init process is a bit costly, but wondering how to Init only once in the .NET assembly (DLL) and use it for all pages on the PDF so I can save time while sending 
    subsequent pages from Delphi for processing from the .NET assembly?

Ta
Tomy
Reply all
Reply to author
Forward
0 new messages