Configure for single character recognition

Simon Støvring

unread,

Nov 14, 2014, 8:42:52 AM11/14/14

to tesser...@googlegroups.com

Hello,

I am trying to recognize single characters written with the Gotham Bold font. I have trained Tesseract by following Michael Jay Lissners guide "Adding New Fonts to Tesseract 3 OCR Engine". I trained it using a newspaper article and removed all characters that I am not interested in as well as making sure all characters are upper case as I am not going to match lower case characters.

I run Tesseract with my custom language and with page segmentation set to 10, which treat the image as a single character.

While most of the matches are fine, I am getting a lot of incorrect matches. For example, the below image of the letter "B" is matched as an "X". I cannot figure out why this is.

And the "B" below which looks the same as the above but it is in fact not the same image, is not matched to anything. Tesseract does not know what is on the image.

The below "C" is not matched to anything. Tesseract cannot figure out what is on the image.

The same goes for the "U" below.

And it thinks the "E" below is a "K".

The above errors are just examples. There are others but I think those four examples illustrate the quirks I'm currently dealing with.

I manually slice the image below into images of single characters like the ones above. Maybe a completely different approach is better?

Does anyone know how I can improve the recognition of single characters? I'ld like the above examples to match correctly but generally it's just not good enough and I'ld like to know if there's any way I can improve it. Should I train differently? Should I pass other configurations or should I process the images before trying to recognize the characters?

Best regards,

Simon B. Støvring

Janusz S. Bien

unread,

Nov 14, 2014, 10:00:08 AM11/14/14

to tesser...@googlegroups.com

What is the meaning of the baselines parameters?

In my outpur I have e.g.

baseline -0.013 0

baseline -0.003 -18

What does it mean?

Best regards

Janusz

--
Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

ShreeDevi Kumar

unread,

Nov 14, 2014, 12:24:15 PM11/14/14

to tesser...@googlegroups.com

Have you tried with the existing english traineddata?

I get good recognition with your 'prepared-image'?

If that is the kind of image you need to OCR, you could do that with psm 6 and then split each letter separately?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e905020c-f0b2-47b6-b09c-e01efa96dcc1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Dev

unread,

Nov 14, 2014, 2:25:14 PM11/14/14

to tesser...@googlegroups.com

It looks like all your characters are uppercase, but if that is not always the case, my experience with doing per character ocr in tesseract is it cannot handle capitalization properly. That is, is it a 'c' or a 'C'?

I layout all my characters in a straight line, and get much better results using PSM line mode.

Simon Støvring

unread,

Nov 15, 2014, 5:09:52 AM11/15/14

to tesser...@googlegroups.com

I have tried with the English traineddata and got similar results. However, I had not tried recognizing the entire 'prepared-image' with psm 6 and I see that gives pretty good results.

The thing is, I need to know the location of each character. That is which row and column it is placed on. If Tesseract fails recognizing a single letter when recognizing the entire image, I have no way of knowing which letter is missing and therefore I do not know the location of any of the letters.

Simon Støvring

unread,

Nov 15, 2014, 5:11:31 AM11/15/14

to tesser...@googlegroups.com

The letters will always be uppercase, so capitlization is not really an issue.

I can try to layout the letters in a straight line and use the line mode. However, I need to know the location of each character. That is which row and column it is placed on. If Tesseract fails recognizing a single letter when recognizing the entire straight line image, I have no way of knowing which letter is missing and therefore I do not know the location of any of the letters.

ShreeDevi Kumar

unread,

Nov 15, 2014, 5:17:07 AM11/15/14

to tesser...@googlegroups.com

take a look at hocr output

and tsv option from https://code.google.com/r/email-hocr-tsv/

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/44df342b-9d7f-42bf-9d1f-d2a9028426ac%40googlegroups.com.

Simon Støvring

unread,

Nov 15, 2014, 6:43:28 AM11/15/14

to tesser...@googlegroups.com

That is exactly what I needed. Thank you.

Reply all

Reply to author

Forward