improving performance (and speed) of tesseract

4,682 views
Skip to first unread message

Cameron Fen

unread,
Jan 25, 2017, 2:50:39 PM1/25/17
to tesseract-ocr
Hi,

 I'm using tesseract to OCR patent data (historical 18th-20th century data).  First off I'm using python and pytesseract and it is considerably slower than the benchmarks posted elsewhere on this site.  My guess is pytesseract is sort of limited for what we want to do and calls a new instance of tesseract every new tiff file it sees (each patent page is it's own tif file).  Anyone have any recommendations of python wrappers that are more powerful (or should I call it from the command line?).  I would rather not use call tesseract from another language but am willing to.  

More importantly though we are having accuracy issues with tesseract.  Attached is the OCR file of only the text that tesseract tried to capture.  I removed pages with only pictures (00000001.txt or whatever).  Also attached are the images ...02.txt and ...03.txt.  I'm basically calling tesseract without any modifiers.  What potential modifiers to tesseract should I use and also if anyone knows a more powerful tesseract wrapper than pytesseract that is preferably in python that can call all these commands.  Thanks,

Cameron 
OCR-.txt
00000002.tif
00000003.tif

Art Rhyno.

unread,
Jan 27, 2017, 12:53:13 PM1/27/17
to tesser...@googlegroups.com

I think pytesseract is a wrapper to the command-line, so you would probably see the same results by going directly. The python-tesseract [1] project used swig to do a deeper level of integration, though I tried the same approach a few years ago and didn’t really notice much difference in throughput. It’s possible you could use a segmentation tool like Olena [2] to carve up the image into individual paragraphs and fire off multiple instances of tesseract (see discussion here [3]). Olena might also give you a way to extract illustrations and such.

 

As for accuracy, I tried a Gaussian blur via opencv on the image and got slightly better results, though comparing OCR can be dicey past a certain level (the blur gave a few more words but also messed up a number). There are lots of tips in the mailing list on improving accuracy by image manipulation, but I’d start with the wiki page on this topic [4].

 

art

---

1. https://code.google.com/archive/p/python-tesseract

2. http://olena.lrde.epita.fr

3. http://stackoverflow.com/questions/4962978/is-tesseract-3-00-multi-threaded

4. https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

.

Cameron Fen

unread,
Jan 27, 2017, 1:34:23 PM1/27/17
to tesseract-ocr
Hi Art,

Thanks!  I am already using multiprocessing on a cluster as I have at least 100k images to OCR.  Correct me if I'm wrong, but if I already have a full load each working on a different image, using Olena to carve the image into paragraphs will not improve speed.  On another note, we don't need illustrations, but if we can remove all the extraneous lines and pictures with Olena will that help accuracy?  I will take a look at the wiki.  Also does anyone know if it is worth it to try and use Tesseract 4.0 given the headache of installing it?  Thanks,

Cameron

Art Rhyno.

unread,
Jan 27, 2017, 3:32:12 PM1/27/17
to tesser...@googlegroups.com

Hi Cameron,

 

I am guessing your cluster software already assigns an instance of tesseract to each core, Olena might be useful on a single machine with multiple cores for a single image, but it sounds like you are way beyond that scenario. I don’t know if it would be worth using Olena to remove illustrations since it has some overhead itself in figuring out the segmentation. Not a lot, but I wonder if it would eat any savings you might see. Sorry, I don’t know anything about Tesseract-4.0, good luck with your processing.

 

art

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a8aa3985-a8ad-4032-894b-7d9782728aba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages