[4.00] Extra symbols produced

38 views
Skip to first unread message

estel...@gmail.com

unread,
Mar 1, 2019, 4:07:32 AM3/1/19
to tesseract-ocr
Gday.

Using 4.00, compiled from release src, Linux env, LSTM engine.

I have pages produced from PDFs (ghostscript) with 300 dpi, then greyscaled using opencv.

Found an issue when ocr output for some specific region has more symbols than there is in the image.

Example: there is an outstanding "word" with "15" in it (actually, it is a part of date - like "15 OCT", identified as two words - which is correct).
Box coords are correct, no other symbols fit in, but output from running tesseract .. --psm 11 --dpi 300 is "156" (instead of "15").

If I cut that part of the image and save it as a separate file, them ocr it with psm=6 (or 7) - output is "15" (correct).

I encountered such behavior only on several symbol combinations - like "15"->"156", "08"->"0O8". Looks like when confidence level between top two identified symbols is very close - both symbols go to output, instead of one.

Did anyone have same issues?

Lorenzo Bolzani

unread,
Mar 1, 2019, 4:46:12 AM3/1/19
to tesser...@googlegroups.com
Yes, I have the same problem, some characters are split, sometimes from one character you even get three ("O0O" for example).



I wrote quite a complex code to try to limit the problem (with psm 13). The idea is this:

Process each symbol individually with iterator:
 - add symbol to current group
 - check if you can close the group
 - if you can close it pick the best symbol/symbols and add them to the result, leave the rest for the following check.

The criteria to "close" a group is based on the distance between symbols, symbol size and confidence. You also need to take care of the spaces, not to drop them, as these are not handled as symbols. Quite a mess.
You need to look at the next symbol to decide what to do. A symbol can be "cancelled" by the next one or by the following one. My code does not fix it completely but is reasonable (with false negatives and a few false positives).

If you want to try this I suggest to first write some code to visualize the boxes, like this.

ocr_boxes_sanit2_11500.png


The very latest version of tesseract (checkout and build from github) handles boxes in a different (better) way, if you want to try this you may want to use that. I do not know if it could fix this problem too.


Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f8649172-a33b-4d29-900d-fc49ff5d42bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages