OCR fails on a preprocessed visually good looking image

100 views
Skip to first unread message

Fab Pi

unread,
Sep 30, 2020, 6:50:44 AM9/30/20
to tesseract-ocr
Hello,

i am currently working on a OCR for detecting text from some cropped region of interests. At most of the roi's it works fine, but for example in the attached image tesseract ignores 'Test'. I have tested different --psm modes. DPI looks fine to me aswell.
  • Any suggestions for further testing or preprocessing?
  • Should i try to provide a set of rois for tesseract to train on it?
Thanks for your help!

cropped_roi_tesseract.png

Jean-Marc Spaggiari

unread,
Oct 1, 2020, 9:46:39 AM10/1/20
to tesseract-ocr
Hi Fabian,

Are you able to try by removing the camera picture on the left? Or it has to stay there? Maybe you can split your picture into smaller one, by looking for vertical delimiters?

JM

Ger Hobbelt

unread,
Oct 1, 2020, 2:57:13 PM10/1/20
to tesser...@googlegroups.com
Hi,

AFAICT tesseract OCR quality deteriorates a lot when being fed 'inverted colors', i.e. white text on black background. (Can't dig up the tesseract blog / article I first saw this mentioned and google fails me in this regard right this minute, sorry.)

Second, from what I gather from all the applications/code I've investigated which feed images to tesseract, the last stage is always a [type of] 'threshold' stage where text is converted to a simple black&white picture: tesseract expects black text on white background.

Given your purple+yellow "image test" image, a simple threshold action very probably would render that as white text on black background, which is the wrong way around if you want to get the best performance from tesseract.

Hence a potential solution vector would be:

- find ways to 'preprocess' your images to ensure each is converted to black text on white background in a subsequent thresholding pass. (Do the thresholding yourself in your preprocess to have maximum control over the image you feed to tesseract.)
  
  (Quick initial thought: it might be good enough to count pixels with each hue, then find the two major 'bulges' in the color distribution and code a quick filter which assigns the hues in the least major hump to black and ones in the most major one to white.
  Another way would be to run a threshold filter and then do this counting on the threshold /output/: pixels there can only be either black or white as the threshold action outputs a monochrome image and thus the code would be extremely easy to count pixels and flip the colors if the black color count happens to be larger than the white color count. Just some rough idea, this.)

- Quick google on 'tesseract white text black background' pops up this as the top entry for me: https://stackoverflow.com/questions/39002966/detect-white-characters-on-black-background-using-tesseract 
  Did a quick scan of that one sounds like it might be good to check out further for you.

HTH

Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/43d66ca1-10f9-40aa-ac02-5d9c8de2f598n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages