- "These scans include characters that are not in the Latin-1 block, which I read somewhere and now can't find is the limit for the English data."
Well, to put it bluntly, diving into the rabbit hole without a helmet nor a 'chute: as far as I have been able to discover, the current "official" tesseract training data "databases" (neural net matrices) that are used to recognize anything we throw at tesseract have been produced ("trained") at google by Ray Smith, using copious hardware from google I expect -- training neural nets is no joy at the average Joe's hardware budget, after all. When you dig through the git commits, such as
https://github.com/tesseract-ocr/tessdata/commits/main/ , you'll find the last training file *content* update was back in '17 by @theraysmith and he hasn't been around long after since:
https://github.com/theraysmith?tab=overview&from=2017-12-01&to=2017-12-31 -- without any hard data, my initial guess is a change of corporate google mind re tesseract.
Stefan Weil et al have done a lot a ton of important work since, but when you ask "what can this baby recognize?" that translates 1:1 to "what has tesseract been trained to recognize?" and there... things get a little vague for me. I'd love to be corrected on this, slapped on the wrist or worse, but from what I've gleaned so far during my research:
- though there's
https://github.com/tesseract-ocr/langdata ,
https://github.com/tesseract-ocr/tesstrain ,
https://github.com/tesseract-ocr/tessdata_best/commits/main/ and Ray Smith's public notes and papers about what was done for tesseract v4/v5 at
https://github.com/tesseract-ocr/docs (which is separate from
https://github.com/tesseract-ocr/tessdoc, which is more user oriented instead of architectural background), I am not confident that the actual list of training files used to produce those master traineddata LSTM files (= tesseract v4/v5 OCR engine) are checked into git: I have seen a list of font names used some place in there (or was it the mailing list?), but for anyone who works with fonts that already is a handwavey kinda thing and, yes, copyrights, yadayada, will forever prevent something more precise to be available because the list most certainly included commercial fonts. Then there's also the training input files defining the "text lines" to be rendered as training material: those actually determine which glyphs in the fonts will be trained at all (and in what combinations). And there I am not feeling confident either, as it looks like those files published are the ones from the older v3 engine, still relevant, but *probably* not what Ray was using to produce those many traineddata files he did at the google shop.
Having dug through the git histories, inspected the various files, scripts and notes about 2 years ago, I cannot say with complete confidence whether your (C), TM and 1/2, 3/4, etc. fraction glyphs have made it into the training set for English back then. My *guess* is that they have been included, if only a few samples, so the neural net will have *some* recollection of them, if my guess is correct, but I also expect them to have "featured little" in the total training process so recognition chances are reduced.
(Aside: As we focus on the English language training set here, I didn't mention the metric ton of work done by @Shreeshrii for Asian scripts, particularly Devanagari and related, a few years later. As far as I can tell, most of the `traineddata` scripts and process today are due to his work and Stefan Weil's, who, if you look over there, you'll note has done a lot of work around OCR-ing (pre-war) German newpapers and similar publications, which was when the Germans had a fondness of printing everything in (to my eyes) quite hard to read blackletter fonts. To make that feat happen, he and the university team (of several German uni's together, if I read what was done right, back when) created a German-specific training set for newspaper blackletter print and published the resulting tesseract traineddata OCR databases for public use (language: "fra" = fraktur). I don't recall seeing a publication where he lists the number of CPU hours used to produce that trained set (one(1) language, few fonts vs. the 400+ allegedly used in the google production run) but you can bet your bottom it wasn't cheap! Or quick!)
When we pop out of the rabbit hole of tesseract history, we might now better understand why your problem is answered... haphazardly:
- general advice number 1 out there is to 'tune' a language training file if you have special needs, such as your wish to recognize fractions, etc., which don't feature often in published texts and thus haven't been a real bother thus far. This "tuning" advice is basically training advice to do a little extra training, which is, to me, a little hairy as you are expected to not deteriorate the existing recognition ability while *slightly improving* the recognition confidence (and thus output quality) for a few glyphs ("characters in your fonts") that are already mostly recognized by the neural net as it recognizes part or all of the relevant "shapes" that make up the glyphs you wish to see recognized. (This is a very rough translation of what a neural net "learns" vs. how we humans might understand pattern recognition, so tread carefully around this blather of mine if you think you're getting a look under the hood. We're rather more *paraphrasing* the engine instead of pointing at its carburetor, spark plugs, etc., if you get my drift.)
Logically, this approach is met with varying success (and crushed hopes) as it is VERY much dependent on the exact shapes and glyphs (characters) you add. (TM) might be helped by being quite close to a T+M superscript, while the fractions being a combo of superscript, subscript and a / slash might be doable or hard for the LSTM+CTC engine, I cannot tell without having tried. And training takes time, both in setting it up and in CPU cycles, so it's not a 5 minute thing to do. Which explains another type of silence around here.
- if that didn't work, you will read several folks advising to "lop off the top layer" and retrain the whole language. What this says is that, basically, the attempt is to wipe just one of the many layers of the LSTM+CTC neural net where it is expected to 'conclude' things like "ah... that there and this shapy thingamajig here, all that jazz is very probably an 'a'..." and hope that that lopping-off-and-retraining suffices to get acceptable training results after running the training for a while (& checking you're doing all right and not overtraining other bits and pieces of the engine's alphabet/text output!)
This takes rather more time than "tuning" as you must now retrain at least an entire layer, while tuning was only intended to have the training activity result in a few cell connections in there being tweaked a little to get what you wanted.
- general advice number 3 is to do what the Germans did and train a dedicated "language", which means you'll need to do all the work of creating font(s), text line training files which include (hopefully) every word and symbol you may ever encounter later on and then cook one CPU or more for some considerable time. I consider that effort approaching herculean, particularly when you're alone. Some have tried, and a few even succeeded it seems from the noises I recall for the last couple of years lurking on this mailing list.
By now you'll surely have gotten the gist of it: from the distance of a mailing list POV, it's all a guess and there's so many little details involved to arrive at success that almost nobody dares venture saying much, at least not all at once. Because this stuff is *hard* to get right and the above can be a cause for scare with some folks.
Me personally, I tried my hand at "tuning" a little about a year ago and it didn't fare well, because I found out I still didn't understand all the processes involved well enough to make decisions that would differ from joining a crap shoot blindfolded. But that is me and I am not into the adrenalin rush of bungee jumping either, so it probably says more about me than about the process of training/tuning tesseract.
Having mentioned the above three options, my personal favorite advice number 4 is: try to come up with a way which can keep tesseract as-is, and adding a review/correction post-process that is acceptable for you. If you find it in your heart to accept that a little copy-editing after the OCR actions is A-okay, you are probably better off, both in time spent and frustration with machines' ways. After all, the initial setup cost for this option is much less for single-person shops, I expect. ;-) (The break-even would be a fairly large number of pages to process...)
- "I've got a mostly English language set of scans (image quality is good but not great, but best I can do without a better scanner"
Personal experience to date is image preprocessing is a "field of active research" (i.e. you need to try and test all your own and any others' ideas that sound more or less reasonable) and has a very strong effect on the outcome of the OCR stage. For instance, you may want to rescale your scanned images and see at which text pixel height they do well/best; previous research says text at 30-33 pixels height is optimal, but yours might differ a little from that, so experiment! (I'll try to do a tesseract run on an image you posted earlier later tomorrow at very resize sizes to see what comes out that one.)
Ditto for post-processing: it might be useful, if the content is important enough to you, to dump it into a word processor / text editor with spellchecker on board for further assistance. A manual review process of some kind is called for, anyway, if you want consistent (very) high quality output.
There's also processors/tools that can do "smart quotes" if you like, but I would reserve that for last; my initial approach there would be to have the OCR engine spit out quotes where-ever they occur and then convert them to "smart" open/close quotes in post, if I wanted. French quotes would potentially be easier to OCR that way (as they appear at different vertical offsets) but I'ld be glad to have *any* kind of quote coming out of the OCR machine: the training sets have been trained on a gazillion fonts and intricate little typography details like "smart quotes" are rather font specific, so recognizing them from an OCR engine's perspective screams "tuning! dedicated font training!" and a little headache starts to develop over here. ;-))
- "Slightly related, how, exactly, do y'all deal with drop caps?"
Errrrm, AFAICT.... we don't. Apologies. Seriously though, I don't recall any positive success info on that one.
Here my initial gut response is to "recognize" the drop caps in preprocessor, i.e. in the "image segmentation phase" and cut them out specifically to have them extracted, rescaled to a sensible "regular text size" and only then fed into the OCR engine. Afterwards the output then has to be recombined with the rest of the image segments' text produce. BUT that is mere theory as tesseract does not yet have a module/subprocess to "identify" possible dropcaps and segment and process them as I just described. Which means that today, you either do that up front and do the recombining afterwards in your own custom postprocess, or you decide to accept a little extra editorial post work by either keeping them in as-is (and expecting errors or at least uncertainties reported by the OCR engine) or maybe tipp-ex-ing ;-) them out in preprocessing and hoping the engine's built-in dictionary resolves half of them due to spelling correction. Any way, this is all currently non-existent, alas, so anything you come up with is better than what is, today.
(I am working on my own copy of tesseract which might improve this a little, but don't expect any miracles there this quarter. I'm /slow/.)
Take care,
Ger
P.S.: it was lying around for a gander, but my tesseract is buggered ATM. Anyway, I installed an "official distro" one yesterday for other purposes and I'll see how your previously posted scans fare with that one when I test a few things on them. To be reported later this week, possibly tomorrow afternoon.