Comparing GetComponentImages to iterate

T G

unread,

Dec 30, 2016, 6:21:49 AM12/30/16

to tesseract-ocr

I'm trying to learn Python in parallel with the Tesseract API. My end goal is to learn how to use the Tesseract API to be able to read a document and do some basic error checking. I've found a few examples that seem to be good places to start, but I'm having trouble understanding the difference between two pieces of code that, while different in behavior, seem to me like they should be equivalent. These were both modified slightly from https://pypi.python.org/pypi/tesserocr .

The first example produces this output:

$ time ./GetComponentImagesExample2.py|tail -2
symbol MISSISSIPPI,conf: 88.3686599731


real    0m14.227s
user    0m13.534s
sys 0m0.397s

This is accurate and completes in 14 seconds. Reviewing the rest of the output, it is pretty good -- I'm probably a few SetVariable commands away from 99+% accuracy.

$ ./GetComponentImagesExample2.py|wc -l
    1289

Manually reviewing the results, it appears to get all the text.

#!/usr/bin/python
from PIL import Image
Image.MAX_IMAGE_PIXELS=1000000000
from tesserocr import PyTessBaseAPI, RIL, iterate_level

image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')
with PyTessBaseAPI() as api:
    api.SetImage(image)
    api.Recognize()
    api.SetVariable("save_blob_choices","T")
    ri=api.GetIterator()
    level=RIL.WORD
    boxes = api.GetComponentImages(RIL.WORD, True)
    print 'Found {} textline image components.'.format(len(boxes))
    for r in iterate_level(ri, level):
        symbol = r.GetUTF8Text(level)
        conf = r.Confidence(level)
        if symbol:
            print u'symbol {},conf: {}\n'.format(symbol,conf).encode('utf-8')

The second example produces this output.

$ time ./GetComponentImagesExample4.py|tail -4
symbol MISSISS IPPI
,conf: 85


real    0m17.524s
user    0m16.600s
sys 0m0.427s

This is less accurate (extra space detected in a word) and slower (takes 17.5 seconds).

$ ./GetComponentImagesExample4.py|wc -l
     223

This is sorely lacking a large amount of text and I don't understand why it misses some stuff.

#!/usr/bin/python
from PIL import Image
Image.MAX_IMAGE_PIXELS=1000000000
from tesserocr import PyTessBaseAPI, RIL

image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')
with PyTessBaseAPI() as api:
    api.SetImage(image)
    api.Recognize()
    api.SetVariable("save_blob_choices","T")
    boxes = api.GetComponentImages(RIL.WORD, True)
    print 'Found {} textword image components.'.format(len(boxes))
    for i, (im, box, _, _) in enumerate(boxes):
        api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
        ocrResult = api.GetUTF8Text()
        conf = api.MeanTextConf()
        if ocrResult:
            print u'symbol {},conf: {}\n'.format(ocrResult,conf).encode('utf-8')
#        print (u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
#               "confidence: {1}, text: {2}").format(i, conf, ocrResult, **box).encode('utf-8')

My end goal relies on understanding where text is found in the document, so I need the bounding boxes like the second example. As near as I can tell, the iterate_level doesn't expose the coordinates of the found text, so I need the GetComponentImages... but the output is not equivalent.

Why do these pieces of code behave differently in accuracy? Can I get GetComponentImages to match GetIterator?

Thanks for any help.

T G

unread,

Jan 2, 2017, 7:25:39 AM1/2/17

to tesseract-ocr

I've continued to spend a little time each day working on my problem. I've found something that fuels my desire to understand what GetComponentImages does differently from iterate_level.

from PIL import Image

Image.MAX_IMAGE_PIXELS=1000000000

from tesserocr import PyTessBaseAPI, RIL

image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')

with PyTessBaseAPI() as api:

api.SetImage(image)

api.Recognize()

api.SetVariable("save_blob_choices","T")

boxes = api.GetComponentImages(RIL.WORD, True)

print 'Found {} textword image components.'.format(len(boxes))

print enumerate(boxes)

for i, (im, box, _, _) in enumerate(boxes):

# api.SetRectangle(box['x'], box['y'], box['w'], box['h'])

api.SetRectangle(int(box['x'])-8, int(box['y'])-8, int(box['w'])+16, int(box['h'])+16)

ocrResult = api.GetUTF8Text()

conf = api.MeanTextConf()

thresholdedImage = api.GetThresholdedImage()

thresholdedImage.save('/Users/chrysrobyn/tess-install/tesseract/scan_2_new_piece'+str(i)+str(ocrResult)+'.tif')

if ocrResult:

print u'symbol {},conf: {}\n'.format(ocrResult,conf).encode('utf-8')

I've highlighted in green my debugging steps. 1) I started saving the boxed images to see what commonality I could see. It's picking out words that work in my Example2.py script below. Unfortunately, api.GetUTF8Text() isn't returning anything for the vast majority of these boxes (second example below, plus this modified one). r.GetUTF8Text() (first example below), however, picks it up with high confidence. 2) I started playing with making bigger boxes. This has yielded some improvements, but nothing drastic.

I'm again left struggling to understand: What can I do to GetComponentImages / SetRectangle / GetUTF8Text to make it match GetIterator / iterate_level / GetUTF8Text?

Message has been deleted

T G

unread,

Jan 2, 2017, 11:15:59 AM1/2/17

to tesseract-ocr

I'm still hoping to learn how to use GetComponentImages / SetRectangle better, but I found a workaround to get what I need out of GetIterator / iterate_level... BoundingBoxInternal is not something I can find documentation for, but I saw a reference to it and decided to see if I could get it to work. I now see bounding boxes for my faster and more correct method.

#!/usr/bin/python

from PIL import Image

Image.MAX_IMAGE_PIXELS=1000000000

from tesserocr import PyTessBaseAPI, RIL, iterate_level

image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')

with PyTessBaseAPI() as api:

api.SetImage(image)

api.Recognize()

api.SetVariable("save_blob_choices","T")

ri=api.GetIterator()

level=RIL.WORD

for r in iterate_level(ri, level):

print r.BoundingBoxInternal(level)

symbol = r.GetUTF8Text(level)

conf = r.Confidence(level)

if symbol:

print u'symbol {},conf: {}\n'.format(symbol,conf).encode('utf-8')

Reply all

Reply to author

Forward

Comparing GetComponentImages to iterate_level

T G

T G

T G