Using different images for OCR and display

80 views
Skip to first unread message

Andrew M.

unread,
Apr 22, 2022, 11:19:42 PM4/22/22
to tesseract-ocr

I'm using the latest version of Tesseract (5.0), and I'm trying to determine whether or not I can insert some preprocessing steps that will -not- affect the form of the final image.

For example, I might start out with an image such as this.

There are different levels of shadow/brightness, so I might use adaptive Gaussian thresholding to avoid shadows during binarization.

I will now run this through tesseract, with the hope of creating an OCR'd PDF in the end. However, I want the image that the end user (and I) see to be the full-color, original image, with the text from the transformed image underlaid

Is there a way to manage this? Or am I completely missing the point here.

Ger Hobbelt

unread,
Apr 23, 2022, 1:26:19 AM4/23/22
to tesseract-ocr
The (or a?) way to manage this is through producing page image based PDF files, which incorporate a text overlay layer.

That way, when a user opens such a pdf for viewing, they'll get to see the original scans, color, deformations, noise and all, while they will, at the same time, be able to select & copy\paste the same content as text, thanks to the embedded text overlay. Same happens for screen readers: the invisible text overlay will be used and read aloud by your computer (if you have the software with these capabilities installed and set up).

So the good news is: you're on the right track.

The other bit is: this pdf processing stuff or whatever other similar process you may find useful (e.g. close captions for video based inputs) is entirely outside the remit of tesseract.

Tesseract is the component meant to hand you raw OCR results, i.e. the raw machine-produced text, given a suitably **preprocessed** input image (like the gaussian thresholded one you showed as second image), and then you are supposed to take that output (raw text plus possibly some image pixel coordinates where the text blurbs were located in the image according to the tesseract machine) and apply whichever postprocess you deem fit. Which may include text cleanup through spell checking or other \ more sophisticated means and maybe feeding it to a tool which will combine this data with the original scans to produce such a pdf.

Tesseract has some features to produce a pdf or similar stuff but don't get confused by this: tesseract's "core business", so to speak, is transforming **preprocessed** image inputs to raw text + pixel coordinates. 
Tesseract only offers *some* input image preprocessing, image thresholding and various output file format options to give some users a basic departure point for ease of use, but these options are, in my mind at least, only there as a "minimal viable product demo" so you'll be able to get at something reasonably believable quickly, before you go and do the rest of the tough stuff. ;-)

Regrettably, I don't have a more clear and employable answer for you: while there are (open source) solutions for this type of process already out there, none work to my satisfaction, so consider this "ongoing research effort" a.k.a. you'll have more to do and find out yourself.

One option *may* be combining tesseract with muPDF (from the makers of GhostScript): while it's not satisfactory yet for **my purposes**, it may bring you that much closer to your goal as bleeding edge muPDF (think git repository master branch, not software releases) incorporates code to load a pdf (f.e. produced by your book scanner hard- + software, hence only carrying page images) and take those page images, feeding them to a linked-in tesseract library, to then take the tesseract text output and place it in the output pdf file.
While it is slow going here due to circumstances, that's the workflow I hope to adjust and augment to suit my needs and maybe it can already serve yours. (A limitation being non-user-scriptable image preprocessing, YMMV)

There are other toolchains already doing this out there (some python based stuff, f.e.) so be sure to check around.


The key take away: think of tesseract as one tool in a whole chain of tools necessary to get what you want. Then keep in mind that the core capability of tesseract is taking b&w thresholded images (BLACK TEXT on WHITE BACKGROUND, mind you!  ;-)  ) and producing raw text from that; everything else you need before and after that step should be custom tailored to your specific needs and quality levels by using additional tooling and processes.

HTH.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bf9dea45-554b-4076-8946-603ca7176090n%40googlegroups.com.

Andrew M.

unread,
May 19, 2022, 2:01:48 PM5/19/22
to tesseract-ocr
I've been meaning to come back and say -thank you- for this response. 

There seem to be unlimited exceptions to any rule I've tried to develop, but things are coming along. One pipeline involves taking the HOCR output from preprocessed images and smooshing it together with the original images using hocr-pdf from the Ocropus suite. It's... sketchy, but it works. My job is "general IT person", so while this falls inside my realm of responsibility, it is 100% outside of my narrow area of expertise (so narrow it is seemingly invisible..).

Anyway, thank you-- it made my life a lot easier (and more interesting) for a few days.
Reply all
Reply to author
Forward
0 new messages