I think pytesseract is a wrapper to the command-line, so you would probably see the same results by going directly. The python-tesseract [1] project used swig to do a deeper level of integration, though I tried the same approach a few years ago and didn’t really notice much difference in throughput. It’s possible you could use a segmentation tool like Olena [2] to carve up the image into individual paragraphs and fire off multiple instances of tesseract (see discussion here [3]). Olena might also give you a way to extract illustrations and such.
As for accuracy, I tried a Gaussian blur via opencv on the image and got slightly better results, though comparing OCR can be dicey past a certain level (the blur gave a few more words but also messed up a number). There are lots of tips in the mailing list on improving accuracy by image manipulation, but I’d start with the wiki page on this topic [4].
art
---
1. https://code.google.com/archive/p/python-tesseract
3. http://stackoverflow.com/questions/4962978/is-tesseract-3-00-multi-threaded
4. https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
.
Hi Cameron,
I am guessing your cluster software already assigns an instance of tesseract to each core, Olena might be useful on a single machine with multiple cores for a single image, but it sounds like you are way beyond that scenario. I don’t know if it would be worth using Olena to remove illustrations since it has some overhead itself in figuring out the segmentation. Not a lot, but I wonder if it would eat any savings you might see. Sorry, I don’t know anything about Tesseract-4.0, good luck with your processing.
art
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To post to this group, send email to
tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a8aa3985-a8ad-4032-894b-7d9782728aba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.