Textbook-like format. Correcting improperly recognized text

115 views
Skip to first unread message

Misti Hamon

unread,
Apr 29, 2024, 2:05:43 PM4/29/24
to tesseract-ocr
Forgive me, I have lots of questions and will be trying to separate out one question per conversation (so that those searching later may more easily find the answers).

I'm working with scanned images of a textbook like layout - occasional drop-caps, text in 2 or occasionally 3 columns that flows around images (sometimes an actual square or rectangle, others the image had the background removed and the text flows around the subject) and jargon (most of the book is English, but there is topic specific jargon, abbreviations of the jargon, and, even worse, acronyms and symbols of said jargon), where fractions are used, they are in the form of smart fractions (so, something like 1/4" uses the space of 2 characters, not 4). Also, the lighting during the scan was uneven and the original images were taken at approx 250 dpi. There is also tabular data (worst case, I'm fine with the tabular stuff not being included in the ocr results).

I've preprocessed the images, including binerization and upscaling to get 300dpi for tesseract to work with, but the uneven lighting wasn't able to be entirely fixed (would need to rescan unless someone knows of a way to fix in GIMP, and that is not an option right now) which made binerization of some blocks on some pages less successful than others.

That's the background, may need to refer back to it with other questions.

So far (I've tried OEM 0 and 1) results are "ok" but there are errors - both high confidence words that are wrong, and low confidence words that are actually correct, as well as difficulty with the fractions and orphans from the drop caps. Some of the jargon related stuff is iffy too (when lighting and binerization is clear, LTSM runs pick most of it up pretty well, though). Using a hOCR viewer - ScribeOCR, which I found out about on list - isn't going so well, the physical book these images were taken from is approximately US Letter sized and scribeocr is "stuck" on showing me the whole page, which makes the text too small to actually read (and since I have wrong high confidence and correct low confidence, I can't depend on the color coding) - if I could read it I could correct there. So, how, exactly, does one go about correcting hocr results?

Jeremiah

unread,
Apr 29, 2024, 2:35:59 PM4/29/24
to tesseract-ocr
Regarding proofreading with Scribe OCR, it is definitely possible to zoom in.  The controls are virtually identical to popular document viewer programs like Acrobat.  You can zoom in on the current location of the mouse using Control + Mouse Wheel, scroll using the mouse wheel, and pan in all directions using the middle mouse button.

Regarding confidence metrics, unfortunately, confidence metrics reported by Tesseract are extremely unreliable on the level of individual words. This is unfortunately not fixable, and is not even unique to Tesseract. I benchmarked Abbyy (paid/commercial OCR program) at one point and found that the vast majority of low-confidence words were correct, and the the vast majority of incorrect words were high-confidence.  Metrics from OCR engines can be useful on a less granular level--a page with average confidence 0.95 will be significantly higher-quality than a page with average confidence of 0.80--however I don't think accurate metrics are possible on the word level. None of these programs have any robust way to evaluate themselves, so the confidence metrics are built using some internal metrics from the recognition process.

If having more accurate confidence metrics is important, one option is to use the built-in "Recognize Text" feature of Scribe OCR rather than uploading data from Tesseract.  This feature runs Tesseract Legacy and Tesseract LSTM, compares the results, and marks words that agree across versions as "high confidence" and words that disagree across versions as "low confidence."  This method is significantly more robust than using the confidence metrics from Tesseract, and generally flags >90% of incorrect text as low confidence.  Note that Scribe OCR uses (by default) a forked version of Tesseract, so recognition results may differ.

Answering questions specific to your document would require providing some of the image(s) at issue. 

Misti Hamon

unread,
Apr 29, 2024, 4:03:16 PM4/29/24
to tesser...@googlegroups.com
"Regarding proofreading with Scribe OCR, it is definitely possible to zoom in. The controls are virtually identical to popular document viewer programs like Acrobat. You can zoom in on the current location of the mouse using Control + Mouse Wheel, scroll using the mouse wheel, and pan in all directions using the middle mouse button."

This was helpful, sort of. I'm on a laptop, with a gesture capable TouchPad and gesture capable touch screen, zooming using gestures did not work (looks more like a OS settings problem I'll have to investigate), but I did pull out an actual mouse and was able to get zoom working that way, so thank you. A request? If possible, could a "fit width" and "fit page" button be added instead of being dependent on a real mouse to get at least some zoom? (Scroll and pan work fine via TouchPad and touch screen)

I'll go through all my images and see if I can find a single page that has most of the issues so I'm not sending several, might take a few days. The main crux of my question is, though, is there a way to post-process "fix" things like missed characters, drop-cap related orphans, commas that are read as periods regardless of how good your input images are, "smart fractions" and any other problems that can't be fixed by tweaking the command used to invoke tesseract? (Neither legacy nor ltsm do well with the drop caps or smart fractions, so running ScribeOCR's recognize would help those anyway, even if it fixes everything else) I do have questions about tweaking the command as well, just haven't asked them yet

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/772ae968-ab23-46ec-ae89-81b9c29602e5n%40googlegroups.com.

Jeremiah

unread,
May 2, 2024, 1:52:07 AM5/2/24
to tesseract-ocr
I updated the desktop version of Scribe OCR to have zoom buttons, and removed the behavior where changing pages resets the zoom.  Therefore, it should be possible to edit all pages after zooming in once, which should make this less of an issue.  

I am not aware of any post-processing program that would fix the issues described without using OCR or manual review. 

Misti Hamon

unread,
May 4, 2024, 12:56:57 AM5/4/24
to tesseract-ocr
Thank you for the zoom buttons, they make it a lot easier to work with! Please forgive the delay in responding (had a bit of a computer issue, was able to save all my data, but took a couple days longer than expected to get everything all set up again). This will be a photo heavy post, easier to show what I'm working with, and the problems I'm having.

I have 3 main types of pages that are getting OCRd - new chapter, "normal" pages with images and pages with images that include drop caps. Except for new chapter pages (only because their text size is larger overall) some pages also have lighting issues leading to the thresholding not always able to get a go "picture" of every glyph. I am including here images showing each page type. Note, these aren't the images that are actually being fed to tesseract, they are jpg conversions of all processing except for mixed raster splitting and thresholding algorithm application (especially for the new chapters, I want to maintain the exact appearence, and having issues with splitting not handling the background correctly, and tesseract missing more characters than what my examples show.
 First, a new chapter page
 0006-a.jpg

Next, a drop cap:
0004-b.jpg

And a "normal"
0005-a.jpg

Two special cases: This one has drop caps as a label for a step
0010-b.jpg

And this one (which, honestly, could be left as just a image, no text layer, but it will show most of the recognition problems I'm having)
0003-b.jpg

As you can see from this page, lighting is a little less than even, I can not rescan right now (the ideal solution for at least some of the problems). Running just the foreground/text layer (attached, not embedded if you want to take a look at the results yourself), with OEM 1, default PSM, and English (this is the only page that has any language except English or jargon, and I'm trying to do the tesseract runs in a script), I get the following (loaded into ScribeOCR for easier comparison). I did not do this run with hocr_char_boxes set, setting it does also help, *some*, at least in OEM 0.
0003-b-proof.png

Zoomed in on the chart to show the primary concerns I have -
0003-b-zoomed.png

As you can see, the fractions get miss identified, or skipped altogether (first 6 entries in the first column are 1/8, 1/4, 3/8, 1/2, 5/8, 3/4). So, I've got the best image I can right now, run OCR, and manually reviewing. Other than trying maybe a different PSM, I now need to manually fix. Characters that can be directly typed are easy to either add to the hocr file directly, or bring in to Scribe (or some other review assist software) and use the edit tools there. Unicode/ASCII characters are a bit harder (been using special character insert functions and the copy and paste the special character, at least until I can remember how to direct type the ascii code for them), but they do get in, at least where some character was already identified, when a "word" is completely ignored, or with the drop cap, I haven't yet figured out how to get the missing character(s) in.
0003-b.tif
Reply all
Reply to author
Forward
0 new messages