Letters split in multiple parts

59 views
Skip to first unread message

Lorenzo Bolzani

unread,
Jul 5, 2018, 12:59:26 PM7/5/18
to tesser...@googlegroups.com

Hi,
I have a small problem with some letters that are recognized as multiple letters.

This is a sample (I can reproduce the problem with this image and eng "_best"):



output is: 17AE4L4

The 4 is seen as three different letters. Maybe the shape of the 4 is not so common and this is creating the problem.

This is how tesseract sees the image (data is taken from the bounding box returned by the iterator, a red dots means the beginning of a symbol):




I'm wondering if there is anything I can do to fix this other than training a custom model on this font (it is part of an mrz, btw).

Even a small edit to the image, like cropping, makes the problem appear or disappear. The output for the other sample is : 17AESL

Are there any parameters like minimum box size, split threshold, something I can ask the iterator, etc. that might help? Or is everything part of the lstm?

I tried a quick fix based on the box sizes and confidence but there are several variations and is not so easy to do it right.



I'm using:

tesseract 4.0.0-beta.3-56-g5fda
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE



Thanks, bye

Lorenzo



split1.png
split2.png

Lorenzo Bolzani

unread,
Jul 12, 2018, 6:01:11 AM7/12/18
to tesser...@googlegroups.com

Any ideas about this? I'm encountering this problem quite often, even with custom training.

I tried to do some data augmentation during training varying the number of pixels on the left but did not help.

Should I report it as an issue on github and discuss it there?


Thanks, bye

Lorenzo


Reply all
Reply to author
Forward
0 new messages