Single symbol recognition

Sven Teum

unread,

Jul 28, 2016, 8:57:15 AM7/28/16

to tesseract-ocr

To check several methods to improve character recognition, I've divided my image in characters and I send one character at a time to Tesseract (characters are fixed width).

I set the page segmentation mode to '10' (treat the image as a single character), I load every character and then I join the results and I get better accuracy than loading the entire image.

The problem is that some symbols are not recognized at all. For example: ':', '-'. It can be tested by loading the attached image into Tesseract.

If I load for example the full line that contains the ':' symbol, it is recognized, but other accuracy problems appear.

I would like to know if I could tweak the configuration to be able to recognize those symbols as single characters.

OS: Windows 10

Output of Tesseract -v:

tesseract 3.05.00dev
leptonica-1.73
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

(Note: I've also posted this issue in stack overflow with no responses http://stackoverflow.com/questions/38607576/single-symbol-recognition-in-tesseract)

Zdenko Podobný

unread,

Jul 28, 2016, 10:21:05 AM7/28/16

to tesser...@googlegroups.com

IMO some character (e.g. oOsSzZwW, but from my experience also ,.:- ) can be correctly recognized only within some wider context (word, line, maybe paragraph).

Maybe you can give us longer example of text you try to OCR, so somebody can give you extra hint.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a799b1bf-0b7c-42ae-89f6-ef73da676b19%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sven Teum

unread,

Jul 29, 2016, 2:45:46 AM7/29/16

to tesseract-ocr

The reason for recognizing one character at a time is that I was trying some different approaches as I mentioned above. I can divide the image since characters are fixed width and height.

Some of my results are:
* Full image: it has some problems when characters and numbers are mixed together. Disabling dictionary has no effect.
* Divide image in lines: problems with numbers and characters remain.
* Divide image in lines, add spacing between characters: it's OK when numbers and characters are mixed (above issue is fixed), but spacing is not well recognized (even if font is fixed width). Using the variable preserve_interword_spaces has not the desired effect, since spaces are not regular. For example, even if the spacing is the same between each character, OCR results can give 7, 8 or 9 spaces between characters so spacing can't be fixed afterwards.
* Divide image in characters: good to recognize characters and spacing, but it can't recognize some symbols (":", "-"), ...

Thanks.

Reply all

Reply to author

Forward