Any suggestions on pre-processing to improve accuracy?

214 views
Skip to first unread message

Traun Leyden

unread,
Jun 20, 2014, 11:51:19 AM6/20/14
to tesser...@googlegroups.com

I'm wondering how I can get better results with Tesseract.  

Here are a few images I've been testing with + results:


Actual OCR text: VCZZSWE
Expected OCR text: VC22500E


Image 
Actual OCR text: ViZZSWE DRIVEWAY
Expected OCR text: VC22500E DRIVEWAY


Any tips on doing pre-processing on the images to improve the recognition?

The code I'm using to call tesseract (via go-tesseract) is here: 

  https://github.com/tleyden/open-ocr/blob/master/tesseract_engine.go#L49-L53

Version: I'm using the tesseract-ocr-eng package from Debian Jessie, which looks to be version: 3.02-2  (the full build script is available in this Dockerfile)



    Nick White

    unread,
    Jun 20, 2014, 3:08:50 PM6/20/14
    to tesser...@googlegroups.com
    Hi Traun,

    > Any tips on doing pre-processing on the images to improve the
    > recognition?

    The place to start would be here:

    https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality

    Nick

    Traun Leyden

    unread,
    Jun 20, 2014, 5:57:11 PM6/20/14
    to tesser...@googlegroups.com


    Thanks, this is really useful.  (and shame on me for not RTFM'ing a bit more first)

    That document mentions to make sure the orientation/skew is straight, but does not give any hints on how to actually do this in an automated fashion.  Any tips?

    Robert Komar

    unread,
    Jun 22, 2014, 2:07:33 PM6/22/14
    to tesser...@googlegroups.com
    On Fri, 20 Jun 2014, Traun Leyden wrote:

    > Thanks, this is really useful. (and shame on me for not
    > RTFM'ing a bit more first)
    > That document mentions to make sure the orientation/skew
    > is straight, but does not give any hints on how to
    > actually do this in an automated fashion. Any tips?

    You can use imagemagick's "convert" utility to deskew
    images. For example:

    > convert <skewed_image> -deskew 40 <deskewed_image>

    It works pretty well for text-only images. Embedded
    images within the text tend to mess it up, though.

    Rob Komar

    Traun Leyden

    unread,
    Jun 23, 2014, 11:45:40 AM6/23/14
    to tesser...@googlegroups.com
    Thanks, I will definitely use that.

    One more thing that document should have is a mention of Stroke Width Transform to improve OCR recognition on images that have a lot of non-text content.

    Here's an example of SWT in action.





    --
    You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
    To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/tjY0jZsopwA/unsubscribe.
    To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
    To post to this group, send email to tesser...@googlegroups.com.
    Visit this group at http://groups.google.com/group/tesseract-ocr.
    To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/alpine.LNX.2.02.1406221104320.7820%40robpc4.home.org.

    For more options, visit https://groups.google.com/d/optout.

    Nick White

    unread,
    Jun 26, 2014, 4:02:52 PM6/26/14
    to tesser...@googlegroups.com
    On Mon, Jun 23, 2014 at 08:32:52AM -0700, Traun Leyden wrote:
    > One more thing that document should have is a mention of Stroke Width Transform
    > to improve OCR recognition on images that have a lot of non-text content.

    Oh cool, that looks great! I definitely will add that to the wiki
    page soon, thanks alot for pointing me to it.

    Nick

    Traun Leyden

    unread,
    Jun 26, 2014, 4:17:16 PM6/26/14
    to tesser...@googlegroups.com

    Btw I original came across it via Project Naptha, which is using Tesseract for it's underlying OCR.


    --
    You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
    To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/tjY0jZsopwA/unsubscribe.
    To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

    To post to this group, send email to tesser...@googlegroups.com.
    Visit this group at http://groups.google.com/group/tesseract-ocr.
    Reply all
    Reply to author
    Forward
    0 new messages