Recognized characters got multiplicated

Abstract

unread,

Jun 24, 2019, 4:43:48 AM6/24/19

to tesseract-ocr

Trying to make recognition with custom trained digits.traineddata - got original files by Shreeshrii, modified makedata.sh with more script fonts to catch specific digits look.

Tesseract binaries are 4.0.0 (from UB-Mannheim said as 4.1.0)

Results are quite good, but see some strange behaviour:

1. Quite often recognized data contains more text than really exist. For example, scanned image has 4 digits, while at output I see 5 or even 6 digits.

Looks like some text characters got recognized twice.

2. Sometimes absolute clear image area produces junk recognition output - 20-30 digits, where some characters are repeated many times (377777775555, etc)

Any ideas?

Abstract

unread,

Jul 4, 2019, 8:03:32 AM7/4/19

to tesseract-ocr

Some more information on my trained data:

real data: 12345678903542331100244117021234567

recognized: 12345678903542331411100244117021234567

(see, instead of 11 were reported several chars 14111 - in this case it does not like letter "4")

another pair real/recognized:

234567890542334239220071212345678905

2345678905423347239220071212345628905

Here instead of 2 a combination "72" was reported, and I found several more cases when same situation happened.

Shree Devi Kumar

unread,

Jul 4, 2019, 8:09:13 AM7/4/19

to tesser...@googlegroups.com

This is an open issue - see https://github.com/tesseract-ocr/tesseract/issues/1060

and other related issues

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/81cc3b76-69e6-4eb2-8925-c88e5828108a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Abstract

unread,

Jul 4, 2019, 9:24:37 AM7/4/19

to tesseract-ocr

Thanks, but as I see the problem is active since 2017, and no clear solution is present.

Now I tried to get recognition result via iterator API, and that's really a strange thing.

All the characted are listed, and those that are "duplicates" share the same coordinates as the correct ones, but have different confidence values.

First idea was to sort them on X coordinate and just get best fit values, BUT the X coordinates returned by TessPageIteratorBoundingBox happen to be totally invalid.

Seems it's some critical bug is Tesseract !!!

Let's take a line of "1234567890". Result returned by iterator is:

>> 1
Conf: 98,65
Box: 1805, 771, 1843, 813
>> 2
Conf: 99,00
Box: 1811, 771, 1875, 813
>> 3
Conf: 99,00
Box: 1843, 771, 1927, 813
>> 4
Conf: 99,00
Box: 1890, 771, 1964, 813
>> 5 <<< DAM, what is here ?! Why letter "5" is reported with X coordinate right after letter "3", while really it goes after letter "5" ?!
Conf: 99,00
Box: 1927, 771, 2001, 813
>> 6 << This one is even more amazing. Letter "6" is said right the place of letter "1", and size is 30+mm !!!
Conf: 99,02
Box: 1805, 771, 2195, 813
>> 7
Conf: 98,99
Box: 2005, 771, 2090, 813
>> 8
Conf: 98,96
Box: 2053, 771, 2127, 813
>> 9
Conf: 99,01
Box: 2095, 771, 2158, 813
>> 0
Conf: 98,98
Box: 2126, 771, 2190, 813

четверг, 4 июля 2019 г., 15:09:13 UTC+3 пользователь shree написал:

Abstract

unread,

Jul 4, 2019, 9:35:00 AM7/4/19

to tesseract-ocr

Also, there're some changes in results depending in recognition mode. All said was for PSM_SINGLE_CHAR mode. libtesseract-4.dll has bug for this mode, at least it produces some debug info that should not appear.

After I changed to PSM_SINGLE_LINE, coordinates returned are much better.

Reply all

Reply to author

Forward