DO YOU HAVE ANY IDEA HOW TO IMPROVE MY OUTPUT??!!

306 views
Skip to first unread message

Claudi Ruiz

unread,
May 21, 2015, 7:48:41 AM5/21/15
to tesser...@googlegroups.com

Goal: Improve as much as possible the tesseract output.
Difficulties: different character sizes and poor image content quality.
Already done: binarize, dilate and erode.

DO YOU HAVE ANY IDEA HOW TO IMPROVE MY OUTPUT??!!



result.jpg

Allistair

unread,
May 21, 2015, 8:57:45 AM5/21/15
to tesser...@googlegroups.com
Is there any specific thing you are trying to get out of the result? Just everything? Because I notice that the source receipt itself is essentially not great in that many chars are invisible/rubbed out etc. there's only so much you can do with OCR - if you're source itself is pretty rubbish, and I mean from a character point of view, not a lighting point of view, then you will be fighting it.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9832a971-f35f-4e36-9ea2-98db720c08dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Claudi Ruiz

unread,
May 22, 2015, 5:05:51 AM5/22/15
to tesser...@googlegroups.com
Thank you Allistair for your answer, it's true that the characters are not completely visible. I will post another example where the characters are visible but tesseract doesn't work properly because maybe my image preprocessing is not the best.

CodeBreaker

unread,
Sep 27, 2016, 10:13:46 AM9/27/16
to tesseract-ocr
For receipt, so far i try all the psm option, the best is psm(6)...and resize the resolution. But for the rubbed out char, that's hardly anything tesseract can do. Correct me if im wrong. :-)

Allistair

unread,
Sep 27, 2016, 4:17:58 PM9/27/16
to tesser...@googlegroups.com
You're spot on - Tesseract is not going to invent missing pixels for you nor is dilation/erosion preprocessing. You have to start with some pixels to stand any chance of success and your receipt exhibits as you note many areas of rubbed out characters. 

Perhaps in some future world there is a way of using ML techniques to train a system on rubbed out receipts such that it can posit what words suffering rubbed out characters are likely to be. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages