Manual review and correction for characters outside of the Latin-1 character set

102 views
Skip to first unread message

Misti Hamon

unread,
May 19, 2024, 11:02:24 PMMay 19
to tesseract-ocr
I've asked a couple different times, and each time I get just a little bit more information, but still not enough to work with.

I've got a mostly English language set of scans (image quality is good but not great, but best I can do without a better scanner, I'm working on that to re-scan but there are some problems that still wouldn't be fixed). These scans include characters that are not in the Latin-1 block, which I read somewhere and now can't find is the limit for the English data. Example characters not being recognized include fractions ( ⅔ instead of 1/8 or 2/3), the TM ( ) or C ( © ) symbols (latter is actually in Latin 1, but isn't directly typeable and, from what I've been able to tell, the circled part comes out so faint on the input image, tesseract thinks it is noise) and "smart" or curly quotes - all characters that require using alt+ codes, insert special character dialogs or letting your wordprocessor/DTP handle converting for you. Which seems to mean they require some level of manual review and correction to be able to get it into the text output. BUT, once you see you need to input manually, how do you handle the positioning data (when working in hocr format)? I considered, briefly, using character whitelisting to help with these, but, that would imply the characters are already included in the character set/wordlist, which if memory serves, many of these aren't?

Slightly related, how, exactly, do y'all deal with drop caps?

Ger Hobbelt

unread,
Jun 3, 2024, 7:06:51 PMJun 3
to tesseract-ocr
-  "These scans include characters that are not in the Latin-1 block, which I read somewhere and now can't find is the limit for the English data."

Well, to put it bluntly, diving into the rabbit hole without a helmet nor a 'chute: as far as I have been able to discover, the current "official" tesseract training data "databases" (neural net matrices) that are used to recognize anything we throw at tesseract have been produced ("trained") at google by Ray Smith, using copious hardware from google I expect -- training neural nets is no joy at the average Joe's hardware budget, after all. When you dig through the git commits, such as https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find the last training file *content* update was back in '17 by @theraysmith and he hasn't been around long after since: https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31 -- without any hard data, my initial guess is a change of corporate google mind re tesseract.

Stefan Weil et al have done a lot a ton of important work since, but when you ask "what can this baby recognize?" that translates 1:1 to "what has tesseract been trained to recognize?" and there... things get a little vague for me. I'd love to be corrected on this, slapped on the wrist or worse, but from what I've gleaned so far during my research:

- though there's https://github.com/tesseract-ocr/langdatahttps://github.com/tesseract-ocr/tesstrainhttps://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray Smith's public notes and papers about what was done for tesseract v4/v5 at https://github.com/tesseract-ocr/docs (which is separate from https://github.com/tesseract-ocr/tessdoc, which is more user oriented instead of architectural background), I am not confident that the actual list of training files used to produce those master traineddata LSTM files (= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of font names used some place in there (or was it the mailing list?), but for anyone who works with fonts that already is a handwavey kinda thing and, yes, copyrights, yadayada, will forever prevent something more precise to be available because the list most certainly included commercial fonts. Then there's also the training input files defining the "text lines" to be rendered as training material: those actually determine which glyphs in the fonts will be trained at all (and in what combinations). And there I am not feeling confident either, as it looks like those files published are the ones from the older v3 engine, still relevant, but *probably* not what Ray was using to produce those many traineddata files he did at the google shop.
Having dug through the git histories, inspected the various files, scripts and notes about 2 years ago, I cannot say with complete confidence whether your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into the training set for English back then. My *guess* is that they have been included, if only a few samples, so the neural net will have *some* recollection of them, if my guess is correct, but I also expect them to have "featured little" in the total training process so recognition chances are reduced.

(Aside: As we focus on the English language training set here, I didn't mention the metric ton of work done by @Shreeshrii for Asian scripts, particularly Devanagari and related, a few years later. As far as I can tell, most of the `traineddata` scripts and process today are due to his work and Stefan Weil's, who, if you look over there, you'll note has done a lot of work around OCR-ing (pre-war) German newpapers and similar publications, which was when the Germans had a fondness of printing everything in (to my eyes) quite hard to read blackletter fonts. To make that feat happen, he and the university team (of several German uni's together, if I read what was done right, back when) created a German-specific training set for newspaper blackletter print and published the resulting tesseract traineddata OCR databases for public use (language: "fra" = fraktur). I don't recall seeing a publication where he lists the number of CPU hours used to produce that trained set (one(1) language, few fonts vs. the 400+ allegedly used in the google production run) but you can bet your bottom it wasn't cheap! Or quick!)

When we pop out of the rabbit hole of tesseract history, we might now better understand why your problem is answered... haphazardly:

- general advice number 1 out there is to 'tune' a language training file if you have special needs, such as your wish to recognize fractions, etc., which don't feature often in published texts and thus haven't been a real bother thus far. This "tuning" advice is basically training advice to do a little extra training, which is, to me, a little hairy as you are expected to not deteriorate the existing recognition ability while *slightly improving* the recognition confidence (and thus output quality) for a few glyphs ("characters in your fonts") that are already mostly recognized by the neural net as it recognizes part or all of the relevant "shapes" that make up the glyphs you wish to see recognized. (This is a very rough translation of what a neural net "learns" vs. how we humans might understand pattern recognition, so tread carefully around this blather of mine if you think you're getting a look under the hood. We're rather more *paraphrasing* the engine instead of pointing at its carburetor, spark plugs, etc., if you get my drift.)

Logically, this approach is met with varying success (and crushed hopes) as it is VERY much dependent on the exact shapes and glyphs (characters) you add.   (TM) might be helped by being quite close to a T+M superscript, while the fractions being a combo of superscript, subscript and a / slash might be doable or hard for the LSTM+CTC engine, I cannot tell without having tried. And training takes time, both in setting it up and in CPU cycles, so it's not a 5 minute thing to do. Which explains another type of silence around here.

- if that didn't work, you will read several folks advising to "lop off the top layer" and retrain the whole language. What this says is that, basically, the attempt is to wipe just one of the many layers of the LSTM+CTC neural net where it is expected to 'conclude' things like "ah... that there and this shapy thingamajig here, all that jazz is very probably an 'a'..." and hope that that lopping-off-and-retraining suffices to get acceptable training results after running the training for a while (& checking you're doing all right and not overtraining other bits and pieces of the engine's alphabet/text output!)
This takes rather more time than "tuning" as you must now retrain at least an entire layer, while tuning was only intended to have the training activity result in a few cell connections in there being tweaked a little to get what you wanted.

- general advice number 3 is to do what the Germans did and train a dedicated "language", which means you'll need to do all the work of creating font(s), text line training files which include (hopefully) every word and symbol you may ever encounter later on and then cook one CPU or more for some considerable time. I consider that effort approaching herculean, particularly when you're alone. Some have tried, and a few even succeeded it seems from the noises I recall for the last couple of years lurking on this mailing list.

By now you'll surely have gotten the gist of it: from the distance of a mailing list POV, it's all a guess and there's so many little details involved to arrive at success that almost nobody dares venture saying much, at least not all at once. Because this stuff is *hard* to get right and the above can be a cause for scare with some folks. 

Me personally, I tried my hand at "tuning" a little about a year ago and it didn't fare well, because I found out I still didn't understand all the processes involved well enough to make decisions that would differ from joining a crap shoot blindfolded. But that is me and I am not into the adrenalin rush of bungee jumping either, so it probably says more about me than about the process of training/tuning tesseract.






Having mentioned the above three options, my personal favorite advice number 4 is: try to come up with a way which can keep tesseract as-is, and adding a review/correction post-process that is acceptable for you. If you find it in your heart to accept that a little copy-editing after the OCR actions is A-okay, you are probably better off, both in time spent and frustration with machines' ways. After all, the initial setup cost for this option is much less for single-person shops, I expect. ;-)  (The break-even would be a fairly large number of pages to process...)







- "I've got a mostly English language set of scans (image quality is good but not great, but best I can do without a better scanner"

Personal experience to date is image preprocessing is a "field of active research" (i.e. you need to try and test all your own and any others' ideas that sound more or less reasonable) and has a very strong effect on the outcome of the OCR stage. For instance, you may want to rescale your scanned images and see at which text pixel height they do well/best; previous research says text at 30-33 pixels height is optimal, but yours might differ a little from that, so experiment! (I'll try to do a tesseract run on an image you posted earlier later tomorrow at very resize sizes to see what comes out that one.)

Ditto for post-processing: it might be useful, if the content is important enough to you, to dump it into a word processor / text editor with spellchecker on board for further assistance. A manual review process of some kind is called for, anyway, if you want consistent (very) high quality output.

There's also processors/tools that can do "smart quotes" if you like, but I would reserve that for last; my initial approach there would be to have the OCR engine spit out quotes where-ever they occur and then convert them to "smart" open/close quotes in post, if I wanted. French quotes would potentially be easier to OCR that way (as they appear at different vertical offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR machine: the training sets have been trained on a gazillion fonts and intricate little typography details like "smart quotes" are rather font specific, so recognizing them from an OCR engine's perspective screams "tuning! dedicated font training!" and a little headache starts to develop over here. ;-))



- "Slightly related, how, exactly, do y'all deal with drop caps?"

Errrrm, AFAICT.... we don't. Apologies.          Seriously though, I don't recall any positive success info on that one. 

Here my initial gut response is to "recognize" the drop caps in preprocessor, i.e. in the "image segmentation phase" and cut them out specifically to have them extracted, rescaled to a sensible "regular text size" and only then fed into the OCR engine. Afterwards the output then has to be recombined with the rest of the image segments' text produce. BUT that is mere theory as tesseract does not yet have a module/subprocess to "identify" possible dropcaps and segment and process them as I just described. Which means that today, you either do that up front and do the recombining afterwards in your own custom postprocess, or you decide to accept a little extra editorial post work by either keeping them in as-is (and expecting errors or at least uncertainties reported by the OCR engine) or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's built-in dictionary resolves half of them due to spelling correction. Any way, this is all currently non-existent, alas, so anything you come up with is better than what is, today.

(I am working on my own copy of tesseract which might improve this a little, but don't expect any miracles there this quarter. I'm /slow/.)



The 'tesseract does best with 30-33pixel high text' stuff is at: https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ
I wrote https://groups.google.com/g/tesseract-ocr/c/B2-EVXPLovQ/m/lP0zQVApAAAJ a while ago; maybe the diagram in there and some paragraphs there aid understanding what's going under the hood, which' info I think you need, like I did/do.



Take care,

Ger


P.S.: it was lying around for a gander, but my tesseract is buggered ATM. Anyway, I installed an "official distro" one yesterday for other purposes and I'll see how your previously posted scans fare with that one when I test a few things on them. To be reported later this week, possibly tomorrow afternoon.

Jun Repasa

unread,
Jun 4, 2024, 7:21:15 PMJun 4
to tesseract-ocr
If tesseract can no longer recognize specific characters, then time to add custom OCR models - Haven't done this though myself, as most documents we scan are pretty normal.

Misti Hamon

unread,
Jun 7, 2024, 1:37:41 PMJun 7
to tesser...@googlegroups.com
Hello Ger, and thank you for responding. 

Regarding training and/or tuning - I definitely don't have the available computing power for a full train, and assuming I'm understanding the requirements (specifically the 1000 images minimum thing) I'm not sure I have enough data for a tune (it's approximately 230 pages that use this font, with only about 50% text coverage on the more dense pages, the rest are non-ocr pictures, even if the 1000 images are single line images, not sure I'd get there). I also have no idea what the font is, I suspect it's one that isn't available to the public (without a hefty fee), so, generating new very clean images isn't possible either (if it's possible to tune using one font and have it apply to others that aren't visually similar, that might actually be an option).

So, we're back to manually fixing after the ocr run and/or using graphics software to further "fix" the images before processing. I could open the hocr files in my text editor and "fix" commas that are read as periods, quotes that aren't quite correct and even super/sub fractions, generating the bounding boxes when whole words are simply ignored due to uneven lighting (even though they are in the input image thanks to running a thresholding algorithm before being handed to tesseract) is something I haven't figured out how to do (if you happen to know how to use The GIMP to selectively darken overexposed areas, that might help a lot. Alternatively, is there a way to do a two run recognition? Something akin to a non-persistent tune - do one run to a text file, manually correct the text file, and have the second run to hocr use that text file as the dictionary to use for that run.

Biggest problem I am experiencing with manual correction: generating or fixing - mostly expand, sometimes contract - the bounding boxes after entering the correct characters when what was recognized as the wrong metrics for what is supposed to be there.

Second biggest problem (which if possible should be fixed first), I need an additional preprocessing step to fix uneven lighting. I have available for use Rawtherapee and The GIMP (was able to fix overexposure, but that darkened everything equally, need a way to spot darken the regions that received more light during scanning those regions are the ones that are most likely to not get recognized at all)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dc048b53-0767-4167-9976-819d2a2e0d8fn%40googlegroups.com.

Misti Hamon

unread,
Jun 7, 2024, 1:45:17 PMJun 7
to tesser...@googlegroups.com
Novels and non-fiction prose (memiors, basic history or whatever) I'm getting good runs, they also happen to use fonts that were, or are close to ones, already trained. Manuals and textbooks - most of the ones I'm trying to work with include pictures and diagrams and other elements to further illustrate or just make things "pretty" and occasionally use non-standard fonts - are causing all sorts of problems. Tuning/retraining isn't possible, not enough data to work with and can't generate more because I don't know the fonts used. I also have a complicating factor of some uneven lighting that I can't figure out how to fix (an overall darken still leads to the areas that were overexposed getting skipped completely, even when running a thresholding algorithm before feeding to tesseract).

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Jun Repasa

unread,
Jun 8, 2024, 7:56:50 PMJun 8
to tesseract-ocr
Guys I am about to start a project, which mainly will improve tesseract ability to recognize text, regardless of input document/image quality.
Am securing some grants/funding. 

If you 're interested let me know.

cheers 

Ger Hobbelt

unread,
Jun 9, 2024, 6:32:07 AMJun 9
to tesser...@googlegroups.com
@JunRepasa: can you share details of your project/intent? (privately + NDA is okay with me if that works better for you)

Why? Because I am working on that area of interest myself; however it's slow going as I'm self-funded and there's multiple focus areas, also outside tesseract and not relevant for this "preprocessing stage" problem. (Slow going means I don't expect any usable results before end-of-year.)


--- /start tangential note 

My own (possibly relevant) goal set is this:

- integrate tesseract more fully in the mupdf tool chain (Artifex already has a basic tesseract run going, but for my purposes I need more fine-grained control per PDF and page image)
  why mupdf based? Because my problem area is very comparable to "scanned magazine page images" packaged as every-page-is-an-image PDFs (electronic datasheets, scanned magazines, ...), which I need to have ready for FTS (Full Text Search) and Text copy&paste of the decoded contents. Image/chart/graphics extraction is a bonus. Anyhow, my inputs can be gang-pressed into PDFs if they aren't already and a couple of loose images (posters / cheat sheets / other single-page publications) if need be.
- easier-for-humans diagnostics output: focus on the preprocessor stage: tesseract has some yet-to-be-diagnosed issues with page segmentation = discovering the little rectangle areas on the page where a text word and text lines are situated; tesseract sometimes turns "deaf" for parts of the page for otherwise unremarkable page images. An experimental version of tesseract of mine outputs the debug output + debug intermediate stage images in HTML format for easier perusal in the browser. This rides on the coattails of the tesseract ScrollView Java tool, but I am not looking for user-interactive; I want HTML-based human-readable debug/diag log output for bulk processes that may be (partially) reviewed at a later date and the review process should be less brain-load than it is right now.
- flexible = more powerful preprocessing stage in tesseract: suppose we fix or remove the current segmentation bugs, what would work for me? Here the plan is to (minimally) modify the tesseract process so the various stages internally become addressable and steerable by a user script (I plan to use JavaScript for this, using QuickJS as the script core): that way I and anyone else can tweak the tesseract process stages without needing to recompile or use external means that may run the executable repeatedly (pyTesseract el al): I need a hopefully faster process as I will be processing page images in bulk on limited hardware: end-users' single machines.
- tesseract CLI / API: allow to input an optional "mask image" next to the page image itself. The fundamental idea here is that the segmentation process is done elsewhere (by human or other machine) and the mask image is similar to one would encounter in the 3D entertainment movie industry: the "mask image" not only encodes which pixels in the image are text-to-be-OCR-ed, but also encodes the *order* in which these pixel groups form words or glyphs to OCR and output in the designated order. Think a multi-layer mask image encoding all you need to unambiguously extract the text in a multi-column or other "shaped layout" page, plus mark any text that's part of in-page charts/graphs/images/footnotes/header/footer, so that we can write them in order to the output HOCR, TEXT or HTML formats. Thus the internal page segmentation logic can be overruled by external image means: all it takes is scanning the mask image to decode what is to be done, where, when, no tesseract segmentation heuristics noise.
- extra image processing: once the scripting works, add other image grayscaling and thresholding algorithms so script-writers can try a few more things and tweak the process to do what they find works for them. Basically that would mean using a PRLib or OpenCV like library, next to leptonica.


The whole concept is based on getting towards a mupdf+leptonica+tesseract-based application which takes a batch of arbitrary PDFs and processes them, rewriting each as a fully searchable PDF with original content visible on-screen while a text overlay ensures text mark/edit/annotate/copy+paste behaviour becoming possible, while a separate text-like output format is fed to a search engine indexer for FTS, so "you can google your own library".

Most of this exists out there already in some partial form or other (except the image mask concept); this is only a success when it ultimately can serve as a ready-to-use end-user application, ready out of the box. 
All future music right now, as this is the target but progress is slow.

--- /end tangential note 


Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


Ger Hobbelt

unread,
Jun 9, 2024, 6:41:03 AMJun 9
to tesser...@googlegroups.com
@MistiHamon:
care to share a set of those scans you have difficulty with? My use for them would be to see if I can improve the results; at least they would be great test material for future development as these are already "known hard to get good results from".
At least I'd like to try my hand at a few of 'em. :-)   (The first one you posted earlier is waiting for that on my todo stack; I want getting my own experimental tesseract going with some new code first, so I can compare vanilla release (UBMannheim) with my own current state of affairs.)

Might be handy to drop a set of them in a Google Drive share or DropBox share; an alternative is dropping them in a github repo and designate it a small test corpus; that way anyone who likes to try them can get them easily and it won't load the others on this mailing list.

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

Jeremiah

unread,
Jun 9, 2024, 1:06:27 PMJun 9
to tesseract-ocr
@ger Not sure if we've spoken before, but what you are describing here sounds very similar in scope to Scribe OCR, an application I maintain (repo [here](https://github.com/scribeocr/scribeocr), application [here](https://scribeocr.com/)).  One of the goals of this project is to create a "ready-to-use end-user application" from Tesseract that does not require additional pre/post-processing, so let me know if that is something you would want to collaborate on.  Additionally, the following capabilities overlap with some of your bullet points above:
  1. Scribe OCR allows for inserting the OCR text into an input PDF using mupdf (rather than rendering everything to images).
  2. Scribe OCR prints the text visually in an editor, and allows for correcting the text manually.
  3. Scribe OCR includes a version of the visualizations from the ScrollView application that can be saved for later review, and viewed in a web viewer.
    1. A demo can be found here: https://debug.scribeocr.com/
      1. Simply upload an image file supported by Tesseract, and the visualizations will be generated and displayed in the browser.
  4. Scribe OCR allows for some preprocessing steps that improve recognition.
    1. Currently the only supported preprocessing steps are auto-rotate and upscaling, and the only step turned on by default is auto-rotate.
      1. Additional control of pre-processing could be added.

-Jeremiah

Misti Hamon

unread,
Jun 9, 2024, 1:47:13 PMJun 9
to tesser...@googlegroups.com
Ger,

Your problem set/end goal is simular to mine (textbooks/manuals not magazines and datasheets and I only have tiff or jpg images, no partial pdfs, but full text search and copy/paste are things I want, and textbooks/manuals do have the same OCR difficulties as magazines).

Can't offer much help on the problems you currently are working on solving, except for the mask issue. If you haven't already looked into this, check out MRC - Mixed Raster Content - images. These are usually tiff format, but I'm pretty sure jpg can be made MRC as well. I currently use a GUI (very) interactive tool - github.com/ScanTailor-Advanced/scantailor-advanced  (use this fork if you are going to try it, the original developer has abandoned development) but internet archive has a python script that *may* work better especially in your workflow (there's a weird bug/problem in scantailor that doesn't properly identify the background when text is over top of an image like in magazines or textbook chapter/part/section openings). My tesseract runs are currently happening against just the mask but you could run them against the integrated image and tesseract only sees the mask - it even ignores text that your mask has been set to not see as text.


I'll get with you privately about testing images...

Reply all
Reply to author
Forward
0 new messages