Letter and number outputted for single character

129 views

Skip to first unread message

Scott Curtis

unread,

Nov 30, 2023, 11:11:08 AM11/30/23

to tesseract-ocr

I'm running an image through Tesseract via a PHP library (https://github.com/thiagoalessio/tesseract-ocr-for-php).

The ouput seems to contain two potential matches for a single character.

I'm expecting "P01.01" but I'm actually getting "PO01.01".

Note the O (letter) and 0 (number, zero).

scott-test.png

Tom Morris

unread,

Dec 1, 2023, 6:50:47 PM12/1/23

to tesseract-ocr

On Thursday, November 30, 2023 at 11:11:08 AM UTC-5 scott...@gmail.com wrote:

I'm running an image through Tesseract via a PHP library (https://github.com/thiagoalessio/tesseract-ocr-for-php).

There's a bunch of really useful information missing (e.g. version of Tesseract), but fortunately this is easily reproducible with the current development version.

The ouput seems to contain two potential matches for a single character.

That's not what's happening. It's actually recognizing both characters separately, although I'm not sure why. The engine does consider the correct string, but the incorrect string scores higher. I'm not familiar enough with the internals to interpret it, but the debug output is below in case someone else wants to give it a go. As you can see it considers both P01.01 and P0O1.01, but picks the latter because it's got a (marginally) better score.

Tom

Processing word with lang eng at:Bounding box=(21,46)->(143,75)
Trying word using lang eng, oem 1
Created window Convolve of size 530, 1700
Created window ConvNL of size 530, 2000
Created window Lfys64 of size 530, 2000
Created window Lfx96 of size 530, 1534
Created window Lrx96 of size 530, 1534
Created window Lfx512 of size 530, 2000
Created window Output of size 530, 1761
Created window LSTMForward of size 1418, 580
<null>=110 On [0, 2), scores= 100(i=83=0.00155) 100(P=28=0.00331), Mean=99.9928, max=99.9963
P=28 On [2, 8), scores= 26.8(<null>=110=69) 97.1(p=103=2.5) 90.3(<null>=110=9.35) 3.18(<null>=110=96.8) 4.89e-05(<null>=110=100) 6.7e-05(<null>=110=92.2), Mean=36.2217, max=97.0573
0=33 On [8, 9), scores= 67(O=21=31.2), Mean=67.0461, max=67.0461
O=21 On [9, 13), scores= 56.8(0=33=42.5) 10.5(<null>=110=82) 1.01e-05(<null>=110=100) 7.72e-06(<null>=110=100), Mean=16.8307, max=56.8101
1=34 On [13, 18), scores= 64.2(<null>=110=34.6) 99.4(l=87=0.559) 98.5(<null>=110=1.46) 5.78(<null>=110=94.2) 1.12e-05(<null>=110=100), Mean=53.576, max=99.3538
.=23 On [18, 23), scores= 83.1(<null>=110=16.9) 100(,=15=0.00592) 94.8(<null>=110=5.24) 0.0434(<null>=110=100) 2.41e-07(<null>=110=100), Mean=55.5837, max=99.9923
0=33 On [23, 28), scores= 52.6(<null>=110=47.4) 99.8(O=21=0.124) 82(<null>=110=17.9) 0.000255(<null>=110=100) 3.13e-11(<null>=110=100), Mean=46.8875, max=99.8451
1=34 On [28, 33), scores= 46.1(<null>=110=53.8) 99.8(l=87=0.0986) 99.1(<null>=110=0.894) 0.586(<null>=110=99.4) 8.5e-09(<null>=110=100), Mean=49.1089, max=99.8028
0 null_char score=-0.191493, c=-0.191493, perm=2, hash=0
1 null_char score=-0.382826, c=-0.191333, perm=2, hash=0 prev:null_char score=-0.191493, c=-0.191493, perm=2, hash=0
2 label=28, uid=30=P [50 ]A score=-0.66966, c=-0.286834, perm=2, hash=1c prev:null_char score=-0.382826, c=-0.191333, perm=2, hash=0
3 label=28, uid=30=P [50 ]A score=-0.928113, c=-0.258453, perm=2, hash=1c prev:label=28, uid=30=P [50 ]A score=-0.66966, c=-0.286834, perm=2, hash=1c
4 label=28, uid=30=P [50 ]A score=-1.12845, c=-0.200338, perm=2, hash=1c prev:label=28, uid=30=P [50 ]A score=-0.928113, c=-0.258453, perm=2, hash=1c
5 null_char score=-1.39284, c=-0.264391, perm=2, hash=1c prev:label=28, uid=30=P [50 ]A score=-1.12845, c=-0.200338, perm=2, hash=1c
6 null_char score=-1.58412, c=-0.191278, perm=2, hash=1c prev:null_char score=-1.39284, c=-0.264391, perm=2, hash=1c
7 null_char score=-1.95755, c=-0.373434, perm=2, hash=1c prev:null_char score=-1.58412, c=-0.191278, perm=2, hash=1c
8 label=33, uid=35=0 [30 ]0 score=-3.04833, c=-1.09078, perm=2, hash=c45 prev:null_char score=-1.95755, c=-0.373434, perm=2, hash=1c
9 label=21, uid=23=O [4f ]A score=-4.51186, c=-1.46353, perm=2, hash=55200 prev:label=33, uid=35=0 [30 ]0 score=-3.04833, c=-1.09078, perm=2, hash=c45
10 label=21, uid=23=O [4f ]A score=-4.87753, c=-0.365671, perm=2, hash=55200 prev:label=21, uid=23=O [4f ]A score=-4.51186, c=-1.46353, perm=2, hash=55200
11 null_char score=-5.06878, c=-0.191256, perm=2, hash=55200 prev:label=21, uid=23=O [4f ]A score=-4.87753, c=-0.365671, perm=2, hash=55200
12 null_char score=-5.26093, c=-0.192142, perm=2, hash=55200 prev:null_char score=-5.06878, c=-0.191256, perm=2, hash=55200
13 label=34, uid=36=1 [31 ]0 score=-5.47957, c=-0.218643, perm=2, hash=24e8e22 prev:null_char score=-5.26093, c=-0.192142, perm=2, hash=55200
14 label=34, uid=36=1 [31 ]0 score=-5.68541, c=-0.205837, perm=2, hash=24e8e22 prev:label=34, uid=36=1 [31 ]0 score=-5.47957, c=-0.218643, perm=2, hash=24e8e22
15 label=34, uid=36=1 [31 ]0 score=-5.91023, c=-0.224826, perm=2, hash=24e8e22 prev:label=34, uid=36=1 [31 ]0 score=-5.68541, c=-0.205837, perm=2, hash=24e8e22
16 label=34, uid=36=1 [31 ]0 score=-6.10178, c=-0.191543, perm=2, hash=24e8e22 prev:label=34, uid=36=1 [31 ]0 score=-5.91023, c=-0.224826, perm=2, hash=24e8e22
17 null_char score=-6.29319, c=-0.191418, perm=2, hash=24e8e22 prev:label=34, uid=36=1 [31 ]0 score=-6.10178, c=-0.191543, perm=2, hash=24e8e22
18 label=23, uid=25=. [2e ]p score=-6.48452, c=-0.191326, perm=2, hash=1000fa0d5 prev:null_char score=-6.29319, c=-0.191418, perm=2, hash=24e8e22
19 label=23, uid=25=. [2e ]p score=-6.67594, c=-0.191424, perm=2, hash=1000fa0d5 prev:label=23, uid=25=. [2e ]p score=-6.48452, c=-0.191326, perm=2, hash=1000fa0d5
20 label=23, uid=25=. [2e ]p score=-6.8672, c=-0.191259, perm=2, hash=1000fa0d5 prev:label=23, uid=25=. [2e ]p score=-6.67594, c=-0.191424, perm=2, hash=1000fa0d5
21 null_char score=-7.05943, c=-0.192229, perm=2, hash=1000fa0d5 prev:label=23, uid=25=. [2e ]p score=-6.8672, c=-0.191259, perm=2, hash=1000fa0d5
22 null_char score=-7.2507, c=-0.191266, perm=2, hash=1000fa0d5 prev:null_char score=-7.05943, c=-0.192229, perm=2, hash=1000fa0d5
23 label=33, uid=35=0 [30 ]0 score=-7.44305, c=-0.192357, perm=2, hash=6f06c6bc7c prev:null_char score=-7.2507, c=-0.191266, perm=2, hash=1000fa0d5
24 label=33, uid=35=0 [30 ]0 score=-7.63779, c=-0.194738, perm=2, hash=6f06c6bc7c prev:label=33, uid=35=0 [30 ]0 score=-7.44305, c=-0.192357, perm=2, hash=6f06c6bc7c
25 label=33, uid=35=0 [30 ]0 score=-7.83177, c=-0.193978, perm=2, hash=6f06c6bc7c prev:label=33, uid=35=0 [30 ]0 score=-7.63779, c=-0.194738, perm=2, hash=6f06c6bc7c
26 null_char score=-8.02303, c=-0.19126, perm=2, hash=6f06c6bc7c prev:label=33, uid=35=0 [30 ]0 score=-7.83177, c=-0.193978, perm=2, hash=6f06c6bc7c
27 null_char score=-8.21431, c=-0.191279, perm=2, hash=6f06c6bc7c prev:null_char score=-8.02303, c=-0.19126, perm=2, hash=6f06c6bc7c
28 label=34, uid=36=1 [31 ]0 score=-8.40869, c=-0.194379, perm=2, hash=3023f02bb9e6 prev:null_char score=-8.21431, c=-0.191279, perm=2, hash=6f06c6bc7c
29 label=34, uid=36=1 [31 ]0 score=-8.60438, c=-0.195692, perm=2, hash=3023f02bb9e6 prev:label=34, uid=36=1 [31 ]0 score=-8.40869, c=-0.194379, perm=2, hash=3023f02bb9e6
30 label=34, uid=36=1 [31 ]0 score=-8.79638, c=-0.192, perm=2, hash=3023f02bb9e6 prev:label=34, uid=36=1 [31 ]0 score=-8.60438, c=-0.195692, perm=2, hash=3023f02bb9e6
31 null_char score=-9.00085, c=-0.204469, perm=2, hash=3023f02bb9e6 prev:label=34, uid=36=1 [31 ]0 score=-8.79638, c=-0.192, perm=2, hash=3023f02bb9e6
32 null_char score=-9.1921, c=-0.191251, perm=2, hash=3023f02bb9e6 prev:null_char score=-9.00085, c=-0.204469, perm=2, hash=3023f02bb9e6

Second choice path:
2 30=P [50 ]A r=1.12845, c=-0.286834, s=0, e=0, perm=2
8 35=0 [30 ]0 r=3.98936, c=-2.11872, s=0, e=0, perm=2
13 36=1 [31 ]0 r=1.86124, c=-0.636992, s=0, e=0, perm=2
18 25=. [2e ]p r=0.765426, c=-0.191424, s=0, e=0, perm=2
23 35=0 [30 ]0 r=0.964568, c=-0.194738, s=0, e=0, perm=2
28 36=1 [31 ]0 r=1.36033, c=-0.204469, s=0, e=0, perm=2
Path total rating = 10.0694
2 30=P [50 ]A r=1.12845, c=-0.286834, s=0, e=0, perm=2
8 35=0 [30 ]0 r=1.91988, c=-1.09078, s=0, e=0, perm=2
9 23=O [4f ]A r=1.8292, c=-1.46353, s=0, e=0, perm=2
13 36=1 [31 ]0 r=1.22425, c=-0.224826, s=0, e=0, perm=2
18 25=. [2e ]p r=0.765426, c=-0.191424, s=0, e=0, perm=2
23 35=0 [30 ]0 r=0.964568, c=-0.194738, s=0, e=0, perm=2
28 36=1 [31 ]0 r=1.36033, c=-0.204469, s=0, e=0, perm=2
Path total rating = 9.1921
Best choice: accepted=0, adaptable=0, done=1 : Lang result : P0O1.01 : R=9.1921, C=-10.2447, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
pos NORM NORM NORM NORM NORM NORM NORM
str P 0 O 1 . 0 1
state: 1 1 1 1 1 1 1
C -0.287 -1.091 -1.464 -0.225 -0.191 -0.195 -0.204
1 new words better than 0 old words: r: 9.1921 v 0 c: -10.2447 v 0 valid dict: 0 v 0

Ger Hobbelt

unread,

Dec 3, 2023, 12:57:24 AM12/3/23

to tesseract-ocr

"diplopia tesseract" + "tesseract btw vat number" search should pop up a few similar issues and message list messages from people with similar unwanted effects due to the same fundamental cause:

the way an lstm (IIRC tesseract mentions it's using BLSTM == bidirectional lstm somewhere) works goes something like this:

the text line image (cropped from your input image, where the crop box has been calculated by the tesseract segmentation code) is fed into the lstm engine as vertical scanline rgb pixel vectors, where the lstm engine outputs an alphabet probability vector per input pixel vector (aka "time step" in lstm theory papers). "alphabet" here meaning "the total set of possible characters we've been set up to recognize", so that's many more than your regular intuitive ABC...Z.

Let's say, for this thought experiment, that every character (*glyph*) in your input image is 10 pixels wide, including the little spacing that separates characters in a printed word, then you'll observe 10 output vectors while that input character image part is scanned and fed into the lstm engine. Each output vector mentions a probability for *every* character in the output alphabet, which for English is the set of a..z,A..Z,0..9,several punctuation symbols and a few others.

Let's say we're scanning the first zero in your "P01.01" input image, then most probabilities are very probably quite low and thus negligible. The "0" looks quite like "O" (capital Oh), somewhat like an "o" (lowercase oh) and quite like a "0" (zero).

That's one part of the story/process. There's more ...

As the lstm engine is usually trained with a large dictionary, as tesseract was (if I read the papers by Ray Smith from Google, who set it all up, correctly) targeted at ocr'ing papers, books and similar printed publications, we have to consider the question "what was/is in those dictionaries? And how do those affect us?": it was lots of words you'll find in regular printed dictionaries, and very few, if any at all, "codes", i.e. "words" which start with one or more alphanumeric characters (A-Z) followed by digits (0-9) as that kind of thing is not expected to show up in books or academic papers, the original target for tesseract (AFAICT).

Assuming this training assumption is correct (and current tesseract's behaviour tells me I'm not far off the mark), we combine this with the concept of "Markov chains" (see Wikipedia and/or your formal training at uni -- and, yes, lstm is way more complex and powerful than HMM and basic Markov chains, but Markov chains serve well enough here to understand what's going on and they're way more intuitive for most people, including yours truly ;-) so I'll use them while trying to explain what you are experiencing). Anyway, for this scenario we can say that lstm engines behave like ("smarter, more able") Markov chains, which I'll try to explain using a few examples:

When we take the English language, and we go back to our youth when we were playing the hangman game (Dutch: galgje), where you have to guess a word with the fewest possible guesses, you quickly learn to try particular characters from the A-Z alphabet first for maximum gain: you try the vowels, "s" and "n", maybe "t" and only then you start to really use your brain if you were like me ;-)

That's you, as a kid, intuitively discovering and applying character frequency of occurrence, i.e. Shannon entropy. (Another inroad into this may have been you encountering Huffman coding.) We'll get back to tesseract internals shortly, but for now try to imagine you are being given the left half of a printed character and that left half looks like "🤇" (left half circle) and your task is to guess which character in the English alphabet this might be/become: your best bet is "O", second comes "C" and when numeric digits are allowed, "0" is another good candidate. The lstm engine is not smarter than you, so you can expect the lstm engine to rank those relatively high, i.e. give these characters high (above threshold) probabilities. And we haven't applied Markov chains yet, which help us improve our estimates!

The easiest example for Markov's usefulness is the letter "q": if we only use character frequency/probability, i.e. Shannon, we know "q" doesn't show up all that often. Another letter: "u" is a vowel, but not the most frequent one by far, so that's another relatively low probability if all we got is Shannon. Markov chains, however, look at the *preceding code element* for improved statistics and that is a game changer: when you observe a "q" in your input stream, it's a near-100% bet that one will be followed by "u" and no alternative! (This "qu" sure thing goes for a lot of European languages, and only when you start feeding Chinese pinyin or Chinese person names to your machine do you lose that particular bet as then you'll observe input such as "Qi Qi", which is a woman's name, like Mary and John, but I digress ...)

So, taking the additional knowledge about frequency of occurrence, hence *probability* for *character sequences*, due to Markov chains, we may realize that once we've observed a "P", the probability distribution for our expected next character changes noticeably and one very good option is the sequence "po", temporarily raising the probability of an "o" occurring next, while the probabilities of many other characters temporarily decrease, e.g. "n" may be quite frequent on its own (notice those "qu"s just now, by the way? 😉 ), but "pn" is highly improbable. Only word with "pn" I can think of right now, off the top of my head, is "apnea", which I haven't heard all that often in human conversation, have you? 😁

So far, so good: we now expect to rank anything that's starting to look a little curvy (left half circle! = only 5 pixel vectors done and already we are like the kid in class who'd raise their hand when the teacher hadn't even finished asking the question) as a sure thing: "PO" output!

The lstm engine is quite like that and it starts to yell this-is-probably-capiral-Oh (plus a few more highly-probables) quite early while we scan across that initial zero. Of course the tesseract engine, having been trained on numbers alongside those human dictionary words, is a little doubtful while keeping it's hand raised, so "0" is listed as second best probability. As we near the end of the character that doubt may turn into a raising awareness that this bugger, while unexpected as it didn't get any "codes" for training so *any* numeric digit immediately following an alphanumeric character (A-Z) is very much unexpected by the Markov chain / lstm engine, is worryingly starting to look like a zero more and more. (Yes, nitpicks, an lstm engine is not an emotional human, not a HMM / pure Markov chain, but the story holds well enough and the machine reality is very similar for this occasion)

Lstm "memory" is more powerful than mere Markov chains technology because it can look further back than just the previous character, so French "EA" occurrence will kick next-bet-is-"U" into the top ranks for example: chapeau, bureau, and many more have taught it to bet that way.

Now we combine the two stories above (Markov/lstm context driven probability distributions + vertical scanline input processing giving you one prediction output vector per horizontal pixel step) and you can expect to see tesseract lstm spit out probability sequences such as

. . O/0/C O/0/C O/0 O O . C/0/O 0/O

for our 10 pixel wide "0" image part we assumed at the beginning. The middle . in this 10 vector sequence is tesseract getting confused / in doubt and it's very common to see this kind of bumpy/hickupping probability vectors coming out of lstm engines (first character in / set being ranked highest, obviously, etc.)

Which leads us to the third and last major magic part of the tesseract OCR engine process: how does tesseract cope with such noisy/burpy lstm output?

Tesseract is not alone in this and many other lstm users / researchers faced the same problem. If you don't do anything about it, chances are that a simple "hello" image input could end up as "lhheeeellooooo" text output, which is... not so great.

Of course you can feed this to hunspell or similar swift dictionary+edit distance based validation and correction technology and get "hello" as most-probable-candidate and take it from there, and many have, but some folks came up with something that *should* be a bit more flexible than basic (fast) dictionary lookup + Levenshtein et al and that something is built into tesseract alongside the lstm engine and is called CTC (Connectionist temporal classification) which you can consider a sort of lstm engine output post-processor, which is fed those alphabet probability vectors (one per horizontal input pixel) and outputs a, let's say, *cleaned up revision* of those so that we can get "hello" without the obnoxious repeats.

Ditto for the scenario where we were scanning through the pixel columns if that initial "0" following that "P" in "P01.01".

INTERMISSION 😰: I am still struggling to grok CTCs myself; it's all nice and dandy when they work out of the box but I know I don't *get them* yet as I don't know what I could do/modify when these buggers *fail*, such as in the diplopia and OP's "P01.01" scenarios. Hence take my words as a "possibly-maybe" guide where I talk about CTC and do read the related papers, keeping in mind that I may have got it all *wrong*! Just a friendly warning: YMMV, caveat emptor and all that jazz. 😅

While Wikipedia is leading me astray where it concerns CTCs, https://sid2697.github.io/Blog_Sid/algorithm/2019/10/19/CTC-Loss.html is one that proved helpful to my earlier attempts to grok CTCs; check it out and do your research.

My current understanding goes something like this: let's take that burpy lstm output we listed above:

. . O/0/C O/0/C O/0 O O . C/0/O 0/O

which tesseract will feed into its CTC engine part, which *should ideally* clean it up and spit out this (what tesseract CTC *actually* produces we'll discuss after this):

ϵ ϵ O/0 O/0 O/0 O/0 O/0 O/0 O/0 O/0

where ϵ denotes "empty" i.e. "expect next decoded character after this break" so that tesseract would thus parse the "P01.01" image as "PO1.01" as "O" is shown as the most probable in this example. And yes, that "O mistake" is intentional! We'll address that at the end, because OP is facing two practical problems at once:

- the trigger: tesseract, trained to decode books and papers, i.e. regular/sophisticated human written language, is fed "text codes" here instead ("P01.01", or Dutch VAT (BTW) "numbers" about a month ago in this mailing list), which are outside the trained language models and thus way outside the area of high quality, low WER (word error rate) output one may otherwise expect from an ocr engine like this,

- root cause (AFAICT): CTC stage, as a concept, is not engineered to cope with recognition confusion like this, where lstm memory characteristic ("Markov" like context sensitive statistics, roughly speaking) drives probability towards human language word forms, which initially "win out" (O!) and then gets flipped half-way through due to increasing shape hints pushing towards numeric digit recognition (0!), with possible ϵ producing confusion along the way.

This is where things get really complicated as you will observe the same sort of CTC "mistake" happening in the mentioned "diplopia" case: AFAICT the lstm engine may output intermittent, rare!, ϵ vectors when input shapes are *somehow* bad enough that the lstm engine becomes "unsure" halfway through a character and thus produces "heeelloϵooo", resulting in CTC output "helloo" instead of "hello". Handwave here as I'm seeing a lot of noise, debugging this thing is a 🐖 and as I said: I don't grok CTCs yet, so reasonable doubt aplenty. (I still keep my options open as I sometimes see sequences without that in-the-middle-of-it ϵ and still getting screwed up final output, so there might be a bug lurking in there as well as a fundamental outside-training-remit use triggering all sorts of potential weirdness. Plus my own sub-par comprehension of CTC I warned y'all about...🤦...)

Anyhow, while the "ideal" expected CTC outputs while scanning the initial "0" zero should be either afore-mentioned:

ϵ ϵ O/0 O/0 O/0 O/0 O/0 O/0 O/0 O/0

or even better for OP:

ϵ ϵ 0/O 0/O 0/O 0/O 0/O 0/O 0/O 0/O

the harsh reality is the CTC stage producing something rather more like this:

ϵ ϵ O/0 O/0 O/0 O/0 O/0 ϵ 0/O 0/O

and that's where it all turns to 💩 as the final chunk of tesseract code takes the CTC output, removes the character runlengths (as it should), picks the top choice and produces usable text plus character bounding boxes; since lstm+CTC don't produce character pixel boundary markers by design, heuristics are used to chop the word image into character boxes, next to the segmentation-section produced word bounding box:

PO01.01

in this case bounding boxes for "O" and initial "0" largely overlap.

All of which may be written to hocr files or csv format or other output formats; the latter carrying the same or less information.

Which brings us to the end:

Currently, you can detect such diplopia/confusion occasions by postprocessing the tesseract output in userland scripts/code, which should inspect the character-level bounding boxes to detect these issues with (hopefully!) high probability. The bits working against making this a "sure thing" are:

- regular print already has plenty situations where characters' bounding boxes overlap a little or a lot: think

+ cursive & slanted text

+ ligatures (which may or may not be expanded)

+ negative kerning in original print ("Ta" and other likely candidates)

- lstm (v4/v5) engine doesn't do character boundary markers, so character boxes are calculated using heuristics. I still have to dig through that chunk of code again, but the smell it gives off is one where ( quite sensibly!) some basic font metrics are assumed and character shape start positions are estimated from the lstm output; I've yet to reread/check this code chunk but my bet today is on that code also producing potential bounding box overlap, until I'm absolutely sure it can't...

HTH,

Ger

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9f2d2046-2b24-473c-8ff5-d1325970b03en%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages