choiceIterator text is null in some cases

60 views
Skip to first unread message

Theodor

unread,
Oct 22, 2015, 2:32:58 PM10/22/15
to tesseract-ocr

I am reading the mrz of id cards/passports - most of the time the OCR is perfect but sometimes I would like to iterate over the choices in order to fix errors. However for some images there are choices missing, as far as I've seen always one full row. Why? Am I doing it wrong? Or is it a bug?

in the example below the first row of the image does not return any choices at all, as seen in the beginning of the output, however being read as seen in the bottom of the output.

1

Output

IELVAEA99907431101080<88884<<<
8010100M1702091EST<<<<<<<<<<<2
SPECIMEN<<ANDREW<<<<<<<<<<<<<<


So far all good, choiceIterator output

(
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
    ),
        (
        "(81.30%) '8'"
    ),
        (
        "(82.51%) '0'",
        "(75.10%) 'B'",
        "(71.87%) 'O'",
        "(71.62%) 'Q'",
        "(71.30%) 'C'",
        "(68.84%) 'G'"
    ),
        (
        "(89.18%) '1'"
    ),
        (
        "(85.36%) '0'",
        "(77.56%) 'O'"
    ),
        (
        "(86.12%) '1'"
    ),
        (
        "(81.99%) '0'",
        "(74.86%) 'O'",
        "(70.67%) 'Q'",
        "(68.59%) 'B'",
        "(68.47%) 'C'"
    ),
        (
        "(85.11%) '0'",
        "(76.91%) 'O'",
        "(71.51%) 'Q'"
    ),
        (
        "(94.15%) 'M'"
    ),
        (
        "(88.53%) '1'"
    ),
        (
        "(85.22%) '7'"
    ),
        (
        "(80.44%) '0'",
        "(76.15%) 'O'",
        "(69.74%) 'Q'",
        "(69.29%) 'C'",
        "(67.53%) 'B'"
    ),
        (
        "(88.68%) '2'"
    ),
        (
        "(85.94%) '0'",
        "(75.14%) 'B'",
        "(71.71%) 'O'"
    ),
        (
        "(76.29%) '9'"
    ),
        (
        "(89.28%) '1'"
    ),
        (
        "(94.65%) 'E'"
    ),
        (
        "(86.10%) 'S'",
        "(77.95%) '5'"
    ),
        (
        "(92.35%) 'T'"
    ),
        (
        "(81.21%) '<'"
    ),
        (
        "(76.13%) '<'"
    ),
        (
        "(83.40%) '<'"
    ),
        (
        "(85.28%) '<'"
    ),
        (
        "(85.74%) '<'"
    ),
        (
        "(83.62%) '<'"
    ),
        (
        "(83.62%) '<'"
    ),
        (
        "(81.84%) '<'"
    ),
        (
        "(80.28%) '<'"
    ),
        (
        "(82.61%) '<'"
    ),
        (
        "(85.72%) '<'"
    ),
        (
        "(91.66%) '2'"
    ),
        (
        "(82.86%) 'S'",
        "(79.72%) '5'"
    ),
        (
        "(87.99%) 'P'"
    ),
        (
        "(90.25%) 'E'",
        "(75.38%) 'B'"
    ),
        (
        "(73.48%) 'C'",
        "(63.71%) 'E'"
    ),
        (
        "(85.36%) 'I'"
    ),
        (
        "(92.14%) 'M'"
    ),
        (
        "(92.45%) 'E'"
    ),
        (
        "(93.64%) 'N'",
        "(79.42%) 'M'"
    ),
        (
        "(73.11%) '<'"
    ),
        (
        "(72.99%) '<'"
    ),
        (
        "(90.35%) 'A'"
    ),
        (
        "(86.72%) 'N'"
    ),
        (
        "(92.94%) 'D'"
    ),
        (
        "(85.07%) 'R'"
    ),
        (
        "(94.44%) 'E'"
    ),
        (
        "(88.69%) 'W'"
    ),
        (
        "(83.70%) '<'"
    ),
        (
        "(80.63%) '<'"
    ),
        (
        "(75.83%) '<'"
    ),
        (
        "(81.21%) '<'"
    ),
        (
        "(84.20%) '<'"
    ),
        (
        "(84.55%) '<'"
    ),
        (
        "(83.27%) '<'"
    ),
        (
        "(83.06%) '<'"
    ),
        (
        "(81.36%) '<'"
    ),
        (
        "(81.34%) '<'"
    ),
        (
        "(78.78%) '<'"
    ),
        (
        "(80.69%) '<'"
    ),
        (
        "(85.49%) '<'"
    ),
        (
        "(82.61%) '<'"
    )
)


The first row is all NULL. The problem seems to be the double "1"s on the first row. Using tessedit_dump_choices I can see that two words are present on the first row, and only on one the others. As the character "1" is narrow, two in a row becomes a large gap. Quite natural to be deemed as a space. However when using a two words with a proper space between them the 
choiceIterator functions as expected again. It seems as if the gap is too large, but also too narrow..? Any ideas how to solve it? Can i force tesseract to treat each row as a single word perhaps? 

Reply all
Reply to author
Forward
0 new messages