Improve ocr on screendump

273 views
Skip to first unread message

player1

unread,
Nov 10, 2020, 4:40:40 AM11/10/20
to tesseract-ocr
Hi Folks

Im new to Tesseract and need some pointers on how to improve the ouput from a game screen dump.

It has some game stats with different types of fonts, at different sizes and one font is skewed to the side.

The screendump has background graphics but its toned down as not to disturb human readings the page.

The screendump might have different resolutions but the position of texts are fixed to particular regions.

So far I have tried reading the page (with tess4J) at 120 DPI and only the simplest text which looks to be about 20pt in size is read out correctly, bigger fonts are completely lost.

What options do I have to improve the output form Tesseract?

Quan Nguyen

unread,
Dec 20, 2020, 12:26:11 PM12/20/20
to tesseract-ocr
You may need to scale the image to 300 DPI for better results. This is especially true for screenshots, where the resolution is typically at 72 or 96 DPI.

Ger Hobbelt

unread,
Dec 25, 2020, 7:42:58 AM12/25/20
to tesser...@googlegroups.com
Also keep in mind that a lot of folks using tesseract have problems with output quality due to feeding it inverted color images, i.e. White text on black background (after converting their inputs to b&w images).

Since you say your input are game screens, chances are high you're in that same boat. 

For best results, make sure your text is BLACK (darkest), your background WHITE (lightest). This may be done by inverting your image colours before thresholding (converting to pure b&w). 

The generic preprocess for tesseract would thus be:

1: analyze & invert? =>
Making sure the text(s) to ocr are the darkest pixels in your image. 
2: analyze and improve color contrast locally? =>
Locate and remove shadow, vignettes, etc. in any areas of the page (image). Goal: improve outcome of next step by feeding it input that produces the least amount of pixel noise. 
With game inputs, depending on the styling of the game, one simple filter might be to pick one of the color channels (r, g, b) or rotated color channels. In other words: does my image contrast / legibility improve when i look at it through a color filter, e.g. a purple filter or yellow or green? When the text pops and the background "disappears" you've got an easy winner. Some times that's all you need. 
3: thresholding
Turn your image into b&w, binary color. That is: all is black or white, no more grays. 
There's plenty to find for that on the net, most of it research (and open source code) aimed at improving scans of old books, manuscripts, but also stuff like license plates. Test and use what works best for you. Picking an appropriate thresholding algorithm will be useful. 


The entire preprocessing endeavor is for one reason only: feeding the ocr engine images that look closest to the training set: black text on white background. 
If you end up with white text on black background, results will be rotten, random quality, until you manage to flip it around to black text on white bg. Anything goes to make it so. If you come up with a preprocess that's mixing or re-ordering stages 1&2, or 1,2 *and* 3, that's fine: those stages only are there to organize the human thought model: when you come up with a process that consistently delivers clean(est) BLACK text on WHITE bg as its end result, you're golden. 

Sorry for repeating the message, but i've found that the "feed tesseract black-on-white, not white-on-black" mantra is the most important, particularly for "unconventional inputs". (Not providing any white margin comes second, i.e. cropping images so severely that the text touches the image edges: always leave (or crop and then *add*) a white border.)
When visually evaluating your (trial) preprocess, evaluate based on this question: could this output i got have been printed in a regular book and is it easily legible to me? 

HTH

Ger


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/14e8cf91-b1bf-4301-9652-a03aa661a387n%40googlegroups.com.

Ger Hobbelt

unread,
Dec 25, 2020, 8:04:00 AM12/25/20
to tesser...@googlegroups.com
Oh, and before I forget, in your case where text is at KNOWN POSITIONS, I've seen others have very good results by cutting up the 'page' into sections, one for each 'text' on that 'page' and then feeding tesseract these sections one at a time as individual images, then recompositing the OCRed 'page' afterwards by merging the tesseract outputs.

This, in my mind, is part of preprocess step 2 ('local tweaks') but anyway.

That way, you can of course easily *scale* one or more if those sections' images as needed to make it look like all of them are '20px text lines on a page'. You get the drift. ;-)



(mind the cropping remark at end of previous message when you do this: always leave (or add) white border in each (extracted) image or you get worse results once again)


Alternatively, one could go and MASK the areas of the image where text will never appear but that would work best if all your texts are about the same x-height, so in your case I'ld go with cutting up the screen 'page' into sections and then preprocess each as necessary. 


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages