inaccuracy in plane text

Mishal Shanavas

unread,

Dec 20, 2023, 7:33:00 AM12/20/23

to tesseract-ocr

i can not extract text with reliable accuracy of a simple text

check it out

https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing

Art Rhyno

unread,

Dec 21, 2023, 9:10:17 AM12/21/23

to tesser...@googlegroups.com

You could try making it smaller, something like:

convert -resize 50% text_l.png text_s.png

Best,

art

From: tesser...@googlegroups.com <tesser...@googlegroups.com> On Behalf Of Mishal Shanavas
Sent: Wednesday, December 20, 2023 7:29 AM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: [tesseract-ocr] inaccuracy in plane text

You don't often get email from mishals...@gmail.com. Learn why this is important

i can not extract text with reliable accuracy of a simple text

check it out

https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com.

Ger Hobbelt

unread,

Dec 22, 2023, 1:51:43 PM12/22/23

to tesseract-ocr

Couple of things to check/test:

- tesseract expects black text (lettering) on white background: that's what is has been trained on and that's what will work best. Hence: try to convert anything to look like that before feeding it to tesseract.

- tesseract was trained on text, if I recall correctly, that's 11pt. Which is what you'll read in several places on the internet and is useless info as-is because pt (points) are a printer/publisher unit of measure for *paper* print, not computer images.

However, this translates to 30-50px total character height, including stem height for glyphs such as p,q,b and d, so the rule of thumb becomes: try to make your text line fit in 30 to 50 pixels height, for possibly best results. (Someone did in depth research about this many years ago, published on this list including charts, but i can't find the link within 60 seconds. Lazy me, sorry)

- tesseract uses dictionary-like behaviour to help guestimate what it is actually seeing (lstm can be argued to behave like a Markov chain, old skool v3 OCR mode uses dictionaries) and that means tesseract very much likes to see human language "words". Stuff like, if you just saw a q, and your language in any Indo-European, you can bet your bottom the next glyph will be 'u'. As in: "QUestion".

Yours, however, is a semi-random letter matrix for a puzzle, so you may want to look into ways to circumnavigate this tesseract behaviour because you are feeding it stuff that's outside the original training domain (books, publications, academic papers).

One approach to try is to go and cut the image up into individual character images and feed each to tesseract individually; you MAY observe better overall OCR results then.

Second, since lstm is fundamentally like a Markov chain (rather: core has Markov like behavioral aspects) and is NOT engineered for single glyph recognition, you may also want to see how classic tesseract V3 OCR modes are doing with your letter matrices as the older V3 engine is single-shape based and thus *potentially* more suitable for use against semi-random, independant, single character inputs like yours.

My 2 cents. HTH

--

Zdenko Podobny

unread,

Dec 23, 2023, 1:16:22 PM12/23/23

to tesser...@googlegroups.com

tesseract expects black text (lettering) on a white background: that's what is has been trained on and that's what will work best. Hence: try to convert anything to look like that before feeding it to Tesseract.

This is not needed (in all cases ;-) ): tesseract inverts a image by itself for LSTM and uses OCR results with the best confidence. Practically it does not work for 100%. But if somebody cares about speed the best way is to use a binarized image with a white background and black text + usage of parameter tessedit_do_invert=0 (or new parameter invert_threshold=0.0)

(Someone did in depth research about this many years ago, published on this list including charts, but i can't find the link within 60 seconds. Lazy me, sorry)

"Willus Dotkom" - link is part of most ignored tesseract part (documentation) - see https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling :-)

Zdenko

pi 22. 12. 2023 o 19:51 Ger Hobbelt <g...@hobbelt.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com.

Ger Hobbelt

unread,

Dec 24, 2023, 10:40:51 PM12/24/23

to tesseract-ocr

On Sat, 23 Dec 2023, 19:16 Zdenko Podobny, <zde...@gmail.com> wrote:

tesseract expects black text (lettering) on a white background: that's what is has been trained on and that's what will work best. Hence: try to convert anything to look like that before feeding it to Tesseract.

This is not needed (in all cases ;-) ): tesseract inverts a image by itself for LSTM and uses OCR results with the best confidence. Practically it does not work for 100%. But if somebody cares about speed the best way is to use a binarized image with a white background and black text + usage of parameter tessedit_do_invert=0 (or new parameter invert_threshold=0.0)

Oh yes, absolutely, but I've seen images where the lstm "recognized" gobbledygook with a reported score /above/ 0.7 and thus skipping that "let's see what the inverted clip gives us" code chunk. While I'm usually fond of some extra detail like invert_threshold, there's way too many novices running into trouble who are probably better off not knowing about this option 😉 so they will put more effort into getting their images to look like white paper (background) with black print on it, before they feed it to tesseract and expect any kind of possibly decent result. Or so I hope.😅

(Someone did in depth research about this many years ago, published on this list including charts, but i can't find the link within 60 seconds. Lazy me, sorry)

"Willus Dotkom" - link is part of most ignored tesseract part (documentation) - see https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling :-)

Right on, bingo!

😰And I didn't check that page for it, while I did run a mailing list search. Whoops!🤦

Seriously though: thanks for mentioning that link again. Very useful info that has been, many times over.

Merry Christmas,

Ger

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com.

Zdenko Podobny

unread,

Dec 25, 2023, 5:15:16 AM12/25/23

to tesser...@googlegroups.com

I put it to documentation because I had the same problem as you (to find it) :-)

Zdenko

po 25. 12. 2023 o 4:40 Ger Hobbelt <g...@hobbelt.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpcuHDGbCAxyg%2B2jNGLcxc96gu_qYzXomS0DTpkf9ehYQ%40mail.gmail.com.

Reply all

Reply to author

Forward