6od instead of God

132 views
Skip to first unread message

mison...@gmail.com

unread,
Nov 6, 2014, 7:48:00 PM11/6/14
to tesser...@googlegroups.com
Hello,
I'm using tesseract 3.02 on Windows 7 and I started with the eng.traineddata that was distributed with 3.02.
Tesseract keeps misreading some symbols, specifically 6 instead of G, I-I instead of H and a few others, so I'm getting 6od instead of God, I-Iercules instead of Hercules and so on. I was hoping that using the dictionary would help with this so I wouldn't have to retrain, because after all it's just these few symbols, but nothing seems to help. So far I've tried:

Cranking up the language_model_penalty_non_dict_word and language_model_penalty_non_freq_dict_word values in the config file
Adding "load_system_dawg T" and "load_freq_dawg T" to the config file (even though it's supposed to do that by default)
Adding the 6->G rule to unicharambigs (as "1 6 1 G 0") and recombining. The I-I -> H rule was already there.
Adding the words God and Hercules to the frequent word list and recombining (eng.freq-dawg).
Emptying both the word list (eng.word-dawg) and frequent word list (eng.freq-dawg) and putting just these two words in and recombining, just to see if it would make a difference. It didn't.

Nothing I've done so far has helped, but it seems to me that the point of using the dictionary is to deal with exactly this type of a situation, so I feel like I must be missing something. Have I maybe missed a configuration step?

Thanks

Ryan Dev

unread,
Nov 10, 2014, 3:42:15 PM11/10/14
to tesser...@googlegroups.com
What PSM mode are you in? I see the H chopped into |-| when using PSM_SINGLE_LINE especially, and I don't think ever with PSM_AUTO.

For my project I was running into the same issue, but I know my glyphs are not ever touching or overlapping, so I simply disabled chopping all together. But for full page OCR that would be bad.

mison...@gmail.com

unread,
Nov 11, 2014, 11:13:53 AM11/11/14
to tesser...@googlegroups.com
I was in PSM_SINGLE_LINE mode indeed, because my text is already segmented into lines, and changing to PSM_AUTO does help with the I-I issue, but I have to say that the overall quality is still better with PSM_SINGLE_LINE. With PSM_AUTO I start getting all kinds of punctuation and other errors. I also tried disabling chopping, leading to disastrous results. My glyphs are not guaranteed to not touch.
I am still perplexed though how tesseract ends up preferring I-Iercules instead of Hercules, when Hercules is a dictionary word and the I-I -> H ambig rule is in place...

ShreeDevi Kumar

unread,
Nov 11, 2014, 12:27:52 PM11/11/14
to tesser...@googlegroups.com
Please attach a copy of the image so that I can try.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2906d195-bc75-4b68-ad97-49f69221d106%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mison...@gmail.com

unread,
Nov 11, 2014, 6:28:41 PM11/11/14
to tesser...@googlegroups.com
OK, here is a clean example of what I'm talking about. Running vanilla tesseract 3.02.02 on this image (in eng and single line mode) yields "6od's family" instead of "God's family". Adding the 6 -> G rule to unicharambigs made no difference for me.


tmp.tiff

ShreeDevi Kumar

unread,
Nov 11, 2014, 9:43:37 PM11/11/14
to tesser...@googlegroups.com
You need to pre-process the image so that G shows up correctly. In the attached image G looks like a 6 as it is connected. 

If that is the shape of G in the font and you need to OCR it, you may either need to retrain or post-process the text.

You could also try with a newer version of program.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Nov 12, 2014 at 4:58 AM, <mison...@gmail.com> wrote:
OK, here is a clean example of what I'm talking about. Running vanilla tesseract 3.02.02 on this image (in eng and single line mode) yields "6od's family" instead of "God's family". Adding the 6 -> G rule to unicharambigs made no difference for me.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Nov 11, 2014, 9:55:10 PM11/11/14
to tesser...@googlegroups.com
I checked with vietocr beta4, which uses newer version of tesseract - it recognizes your tiff correctly.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

mison...@gmail.com

unread,
Nov 11, 2014, 11:49:14 PM11/11/14
to tesser...@googlegroups.com
Yes, I can pre-process each individual image to make it work, but unfortunately I've been unable to come up with a consistent pre-processing method that would work in general. I've been trying for a while now.
I've known that retraining is an option from the beginning but I'm concerned that it may fix some problems and introduce others. The default eng.traineddata works pretty well except that every once in a while a character is misread.
I've just downloaded and tried vietocr 4 beta and while it does get this one right it regrettably still misses quite a few others.

What I really need is a dictionary lookup for every non-word or garbage word tesseract finds that would return the best dictionary match. I'm thinking about writing my own but that would be absurd if tesseract is supposed to already contain this functionality. I understand from Ray's explanation here that the correct character choice is not ranked high enough to be considered for a dictionary match, and that would make sense if I didn't have an ambigs rule for it. But if I have an explicit unicharambigs rule that says consider replacing this character with another to look for a dictionary match, I don't know how tesseract still ends up preferring a non-word over a dictionary match?
I keep thinking I must be missing some obscure config setting. I've already tried tweaking a while bunch of them from this list but to no avail.
Reply all
Reply to author
Forward
0 new messages