Hello,--Being a beginner toward Tesseract, I'm facing a problem I hope experienced Tesseract users will bring a simple/obvious solution to.I am running Tesseract on codes I want to read. I run tesseract.exe with this command line : "tesseract.exe in.png out configfile"Here is the content of my configfile :tessedit_create_boxfile 1tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZI run it on images that look like this one :Most of the time, the characters read and the boxes are OK. But I identified 3 different issues that happen time to time.I - Wrong character read, confusion between '0', 'O' and 'D'.For example, for this image :Tesseract gives me : "UFO05D424091"I am aware that a training would improve recognition but for some reasons I don't want to explain here, I can not do that and I was hopping the recognition engine would work well on such a simple font. Is there any parameters to set in order to improve the results ? I add that since D, 0 and O are likely to appear in the codes, I can't exclude D and O with the whitelist.II - Threshold artifacts disturb the recognition.When my threshold operation leaves some black pixels, like on this picture :The resulting boxes are :The recognized code is right, but the fact that the boxe is wrong is very problematic in my application. I know I could improve my pre-processing, doing a morphologic operation for example, but I want to know if there is a setting that could make tesseract ignore these black pixels. That's strange that the fact that a character of a word is way bigger than the others does not bother tesseract.III - Wrong character segmentation.Whereas the 2 first problems are understandable, I don't get how this one can happen.Let's take the first example :it leads to these boxes :and the following recognised code : UM050409017.Here is the second example :leading to :and the code is : UAZZO51717151.How is this possible ? The input images are perfectly clear, I don't see the problem. Again, is there a setting to set in order to avoid this ?I hope I am missing something obvious, for at least 1 of my problems. I have to admit that the list of all the possible parameters (that I found here : http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version) is hard to master, and since I am a beginner I don't know what to do now.Thanks in advance for your help, I attached an archive containing all the images.Regards
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ba001838-4465-4bea-ab83-782af58c2c01%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I can't really use pre defined patterns since the code pattern and font can change over time.
Thanks for your reply,I can't really use pre defined patterns since the code pattern and font can change over time.I like the idea to segment the characters myself before giving it to tesseract one by one, but it looks time consuming (coding it I mean).Isn't there any other suitable method ? In particular to solve the 3rd issue, which I think must be easy to solve.
On Wednesday, May 20, 2015 at 12:29:08 PM UTC+2, Dmitri Silaev wrote:
One no-brainer method to try out would be turning off all dictionaries and using your own custom "user-patterns" file. Since you said about "your application" I suppose you can program. So you can take a look at the comment preceding read_pattern_list() declaration in "dict/trie.h" for more details.It seems all your strings are of the same format:\A\A\d\d\d\d\d\d\d\d\d\d(Tess understands very limited pattern syntax).But if accuracy is critical in your app, in the long run I would absolutely avoid using any parts of Tesseract except char classifier. I.e. crop every single char out of your source image and run Tess in the single char PSM. I think it's should be easy as long as location of every character is quite stable among your source images. ImageMagick/shell scripts would suffice.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0da310e9-57b6-41a1-a363-66d35dc1bc19%40googlegroups.com.