What I trained tesseract with (that's the "V" letter) : http://i.imgur.com/NbmVqkb.png (segments are all linked)
What I feed tesseract with : http://i.imgur.com/0E4iXXk.png (some segments are linked, some aren't)
Hi,
I wonder if it has something to do with the sizing of the characters in the image that you are using for font training. I swapped out the character without the linked segments for a character in a set I am using and it seemed to work ok. The set is too big for the list but I have attached the image I used.
art
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To post to this group, send email to
tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/451dbd65-20b7-437a-8b5b-a0a726bdad06%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi,
I am guessing my attachment didn’t make it to the list but the character I used is about 17x25 pixels. I resaved the sample as a PNG (instead of a TIFF) and am trying again. Remember that you can (and often have to) edit the box files for training. Tesseract may split your character into more than one blob, but you can override this. By default, the “makebox” produced:
l 45 254 53 279 0
’ 55 267 62 277 0
But I modified this to be:
V 45 254 62 279 0
I found this blog post really helpful for training [1]. You can contact me off-list if you want the entire training set I used, but I only did the one character.
art
---
1. http://michaeljaylissner.com/blog/adding-new-fonts-to-tesseract-3-ocr-engine
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f0135b3-ced6-439c-8272-66299e6c2a03%40googlegroups.com.
Could you attach the “my_font_exp0.png” and “my_font_exp0.box” that are producing the “Empty page!!” message?
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44f83e75-7a97-4d1e-a6dc-68533fc75b2f%40googlegroups.com.
When tesseract can’t find a matching blob, it gets trickier but at least it is working with something. I am guessing some of the gaps between segments are passing a threshold for belonging to a single character. I tried a few different sizes, but I couldn’t get the “B” recognized and I wonder if opencv might be a better route if the source of the characters is fairly static. There’s an example here of using opencv with handwritten numbers [1].
art
---
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/831536ec-bbc5-44e8-b273-0118e287049d%40googlegroups.com.
Well the good news is that tesseract tells you in the training process what it can and cannot work with. I'd be tempted to use the gaps in the line segments to break apart the letters, for example, instead of "C", train for the top part to be something like "r" and the bottom to be another unique character, and then put them together in post OCR processing. I'd separate the "X" in the same way. The other option, and the one I would investigate where the segment gap doesn't go across the letter, for example, on the "B", is to scale it down to the point that tesseract would work with the blob as a single character. This makes for a painstaking process to be sure, but I think it could work. I should note that you can configure settings for more flexibility in blob detection [1] but that's beyond anything I have ever done. I have tried opencv for pattern detection, I wouldn’t call it OCR, and it seems very powerful, but I haven’t used it enough to speak to whether it is the right hammer in this case.
art
---
1. https://code.google.com/p/tesseract-ocr/wiki/ControlParams
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2e54acb2-2505-475b-8fa2-846ecf3ce36b%40googlegroups.com.