Yes, I can pre-process each individual image to make it work, but unfortunately I've been unable to come up with a consistent pre-processing method that would work in general. I've been trying for a while now.
I've known that retraining is an option from the beginning but I'm concerned that it may fix some problems and introduce others. The default eng.traineddata works pretty well except that every once in a while a character is misread.
I've just downloaded and tried vietocr 4 beta and while it does get this one right it regrettably still misses quite a few others.
What I really need is a dictionary lookup for every non-word or garbage word tesseract finds that would return the best dictionary match. I'm thinking about writing my own but that would be absurd if tesseract is supposed to already contain this functionality. I understand from Ray's explanation
here that the correct character choice is not ranked high enough to be considered for a dictionary match, and that would make sense if I didn't have an ambigs rule for it. But if I have an explicit unicharambigs rule that says consider replacing this character with another to look for a dictionary match, I don't know how tesseract still ends up preferring a non-word over a dictionary match?
I keep thinking I must be missing some obscure config setting. I've already tried tweaking a while bunch of them from
this list but to no avail.