Prescription scan recognition

210 views
Skip to first unread message

Mert T

unread,
Feb 8, 2024, 11:16:16 AMFeb 8
to tesseract-ocr
Hello,

I'm new to Tesseract and have the problem that the text recognition has many errors. What I'm doing is scanning a prescription in German, and I want to show only certain areas.
So I created certain areas (marked in blue) as new Bitmaps and used them in the Process Image method. I edited the Bitmap with A Forge to get rid of the red text and make the gray text darker(Screenshot). The 'X' is not recognized. If any letter is recognized, the checkbox should be checked.
I tried to get better results with a better scan quality (600 dpi), but I got the best results with 150 dpi.
Tesseract has many functionalities, I tried some of them but I don't know how to use them well to solve my problems. Could someone help me out?

Thanks.

Here my Code:

public string ProcessImage(Bitmap image)
{
    image = RemovePinkTextAndMakeGrayTextDarker(image);

    using var engine = new TesseractEngine("./tessdata", "deu", EngineMode.Default);
    using var img = PixConverter.ToPix(image);
    using var page = engine.Process(image, PageSegMode.AutoOsd);
    return page.GetText();
}

private Bitmap RemovePinkTextAndMakeGrayTextDarker(Bitmap image)
{
    var filter = new EuclideanColorFiltering
    {
        CenterColor = new RGB(Color.HotPink),
        Radius = 80,
        FillColor = new RGB(Color.White),
        FillOutside = false
    };
    filter.ApplyInPlace(image);

    var filter3 = new EuclideanColorFiltering
    {
        CenterColor = new RGB(Color.DarkGray),
        Radius = 80,
        FillColor = new RGB(Color.Black),
        FillOutside = false
    };
    filter3.ApplyInPlace(image);

    return image;
}

150 scan.png

Screenshot marked.png

Scanarea.png


Mert T

unread,
Feb 15, 2024, 4:22:38 AMFeb 15
to tesseract-ocr
Any ideas?

Ger Hobbelt

unread,
Feb 15, 2024, 11:06:38 AMFeb 15
to tesseract-ocr
Re "X" checkbox:

Since this is a (I assume) standardized form, those checkboxes are at known, fixed, positions.

Couple of thoughts:

1: assuming everyone "crosses" a checkbox is a faulty assumption. Some people, depending on circumstances, "blacken" the box in other ways, all legal and to be expected:
- a slash with a pen, sometimes a fat one: marked = checked.
- an arbitrary squiggle to "fill" the box more or less; when I observe people, l often see shapes like flattened S or greek Xi, but expect circles (filled and non filled) and really any other shape that is not too much effort to put plenty ink onto an area. 
- rare but happens: fully blackened. Think: bored/upset/angry/tic/OCD/autism/...   Talk to people who process (paper) voting forms if you want to research it. That would be my first stop anyway. (The Dutch use paper voting forms, which, by law, must be inspected by humans, so you will find knowledge and observations there that a cheaper, less quality oriented, machine process won't ever give you. Find out who volunteers for voting committee duty and take it from there.)

Bottom line: consider what humans do, observe and consider what they might do, before taking an example given to you by your client or boss as "it" - your success here depends on the supplier: you'll have to train them too ;-)


2: given that a checkbox is not a letter/word field but an inked/not-inked? field, feeding it to tesseract is both overkill and adverse to success. Better to apply those image filters and count the number of black pixels in the (known) area: ANYTHING (any ink blot) above a certain (low) threshold signifies "checked"; the low threshold is not zero to account for dirt, coffee stains and other real world mishaps that can ruin a scan. To be determined in a field test during development I would say.

Ruin a few extra forms with cola or coffee partial baths and include them in your test sets if you care about input range / quality. For dirt, which will cause incorrect machine conclusions if you don't sensibly filter+threshold, drop a few firms on the wet earth outside and tread on them.  (If I were at your spot,then, yes, I would expect forms with boot tread marks across them as part of the feed once you get this "into production" aka "when we went live". Drop, abuse, dry, scan.

 People carry forms. Shit happens on the way to you. And they will gladly entertain the thought of lynching when your "software" gives them no meds or the wrong stuff. 
Bottom line: your "minimum acceptable output quality" is strongly dependent in where you are within or without the handing-out-meds primary process.


Anyhow, Checkboxes, from my perspective, don't need heavy CPU loading and power burning ai solutions. Image filtering however is a must: preprocessing FTW! ;-)


-----

About the text fields:

I haven't tested your images but I expect medium grade success rates; as I wrote in another conversation in this mailing list a few weeks ago, tesseract is engineered to "read" books, academic papers, etc.
That also means the specific JARGON of medicine prescription forms does not match that world view to a tee. Hence you will need further work (dedicated model training) fr any OCR engine (tesseract, trocr, etc) you apply.

The JARGON I see has at least two joint categories:

1: medical brand names, chemicals, etc.: commercial recognizers for speech and writing have dedicated medical models and that's what you pay for. Big bucks as it's lots of specialized effort and from what I saw in speech recog, available in segregated per medical field form where possible as the vocabularies are huge and when you want to recog psychiatry jargon, anaesthesia jargon and all the rest is just horribly complicating NOISE making recog all that much harder. So you get rid of it at the earliest opportunity -- in this case the workflow design and system criteria phase.

Take away/lesson: fir top quality you must investigate the incoming "language" and produce and train a tailored model. Larger language is fast increasing work cost estimate. (Human work for creating and training the model; then the same for the machine as the machine model will be sized accordingly)


2: part of the JARGON, or "the language spoken here" if you will, are the shorthand "words", e.g. "3x", "70mg", "w.food" (to be taken together with some food), etc.etc. I expect a large and possibly *inconsistent* set of short hands as those will surely differ per human author.

Those short hands are specific to your input language (think of "language" as "anything that can be written here and is to be understood by the recipient: apothecary and/or client", not high school language ed. -- another unrelated jargon/shorthand bit right there, btw ;-) ) and will be harder to recognize as they have not featured, or only featured lightly/sparingly in the tesseract training sets AFAICT. Which is the reason I expect only *medium grade* recognition result quality out of the box. For tesseract or anything else you grab off shelf.



Last thought: if your input is always done in the same "look": same font and same brand dot matrix printer serving as the "physical renderer" then you might want to consider looking into trocr or other alternatives as those are (I suppose) possibly easier to train than a widely generic lstm+CTC engine such as tesseract. But that's a wild guess as I haven't done this myself for your or similar enough scenario.


Another idea there (wild, as in: untested to date as far as I know) is to run this through TWO disparate OCR engines, say tesseract and something like trocr, have both output hocr format or similar, ie the full gamut of content+scores+pagepixelcoordinates and then feed those into a judge (using a NN or whatever is found to work for your scenario) as part of your post processing phase, picking "the best / most agreeable of both worlds" per word or field processed, driven by the word/character scores from each. 
.



Ergo / Mgt.Summary: application of tesseract and any and all preprocessing and postprocessing is highly dependent on your place in the overarching primary processes, it is highly dependent in how and when it impacts any humans, patients/clients in particular (the ethics board and the DoJ may want a word one day, perhaps) and hence any and all thoughts/ideas or other work, effort and musings not closely involved with your project -- and bound by mutually agreed written and signed contract -- are, at best, to be rated as conjecture and sans merit / subject to all disclaimers of liability of any form and any kind: YMMV.

My thoughts, HTH,

Ger

PS: in this story, i assume the proper placement (Aka bracketing) of the form itself, ie the image preprocessing BEFORE segmentation and OCR image preprocessing, as *already solved*, so there won't be any doubt about the position and size of each form field, checkbox or otherwise, down to pixels coordinate level.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e618ccd0-3832-4c22-8b2b-b90769cc9d2an%40googlegroups.com.

Ger Hobbelt

unread,
Feb 15, 2024, 12:18:27 PMFeb 15
to tesseract-ocr


On Thu, 15 Feb 2024, 17:06 Ger Hobbelt, <g...@hobbelt.com> wrote:
Re "X" checkbox:


More shorthand examples in your "input language":

Tabl.  = tablet (pill)
tägl   = täglich (German: daily dosage)


I mention these extra examples (visible in the scanned images) as I find generally people have a hard time wrapping their head around the CS "language" word as it is CS-specific jargon: a "language" is both the structure and all the "words" (vocabulary) you use. As such, "tägl", "tabl", etc. are just so many more plain *words* in the language used here. A machine doesn't know or care about human smartness constructing shorthand or acronyms. For a recognizer, it's basically just more words that are just that much harder to recognize correctly as they have increased entropy (less internal structure) compared to the other, more usual, words in the language used.

Tesseract, and any OCR engine, recognizes a trained (CS jargon!) language. If a word, which you and I may call a shorthand or otherwise, did not feature in the training set, then the "hidden Markov model"-simile in the engine will rank the raw initial pattern recognition result a bit or a lot lower, depending on circumstances, and thus you will observe lower scores for untrained jargon or regular wirds with typos in them, such as the "wirds" just now. (It would like to read "words" or "wards", but "wurds" and "wirds" are unlisted, hence English language errors and thus, while possibly correctly recognizing it as "wirds", will surely rate that word a (slightly?) lowered score.)

Acronyms, for example "YMMV", are, from a Markov chain / machine perspective, completely nuts as there's no other word in the English language dictionary that contains the "mmv" triple consonants combo. Hence any recognizer must be explicitly trained to recognize it, by including it in the training dictionary, and by now you'll realize it will require additional training rounds due to its weirdness of having "mmv" in there, plus the moderately rare "(SOW)Y" starter ( (SOW) = start of word edge marker): "you", "yoghurt", "ypsilon", ... The Y section in your old printed dictionary wasn't all that large either, but it's common enough to having been picked up during training. The "mmv" will kill it, score-wise, if "YMMV" wasn't in the training set. ( What OCR system designers do is pass such stuff along as severely lowered scores marking it as doubtful/untrustworthy/WTF, which I dramatise as "killing it")

Ditto for your (German and semi numerical) shorthands: "3x" for three times was hopefully part of the trained language model. I haven't checked, I don't know.

Anyway, if that word score drops too low, tesseract decides not to list the word at all in its output. Lots of folks entering this mailing list suffer that fundamental issue: lower scores and output silence due to feeding tesseract "wirds" that do not exist in the chosen models' training sets, such as product SKUs. The issue is often compounded by other score-decreasing circumstances, for nothing is truly easy here.

Modern high grade recognizers all have implicitly embedded Markov models (think: trained dictionaries plus word stemmings and ~-endings; this thought model is off but close enough for initial comprehension) so you cannot "switch off / disable" the language dictionary for tesseract v4/v5 models like you could/can for the old v3 ones (which obviously do worse in general) and consequently you cannot prevent the engine from "downgrading" shorthand and other words unknown at the training phase.

The corollary of this is: this is why medical and legal recognizers for speech to text and print to text are highly specialized and dedicated endeavours which come at a steep price. Because the consequences of an *additional* mistake are very expensive, in all regards, not just liability, but also ethically and .......





Ger Hobbelt

unread,
Feb 15, 2024, 12:51:05 PMFeb 15
to tesseract-ocr
Re tesseract output for "mittag" etc in your sample: first port of call for "cleaning up dot matrix printer" for OCR, i.e. dedicated image preprocessing would be googling

leptonica image morphology, open close expand dilate dot matrix

or some such.

While I would go with using leptonica for that, as tesseract already uses the same lib and I'd rather code this in c++ or shell /node, the opencv documentation for the same math ops is more intuitive to me. https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html

This is always finicky stuff so getting the parameters just right is an exercise left to the reader today. ;-)

I do recall dot matrix images woes mentioned before in this ML, but it's a long while back and a quick search didn't dig up those conversations' hrefs.


Mert T

unread,
Feb 21, 2024, 5:48:37 AMFeb 21
to tesseract-ocr
Thank you for your detailed answer.
Reply all
Reply to author
Forward
0 new messages