Thanks, but as I see the problem is active since 2017, and no clear solution is present.
Now I tried to get recognition result via iterator API, and that's really a strange thing.
All the characted are listed, and those that are "duplicates" share the same coordinates as the correct ones, but have different confidence values.
First idea was to sort them on X coordinate and just get best fit values, BUT the X coordinates returned by TessPageIteratorBoundingBox happen to be totally invalid.
Seems it's some critical bug is Tesseract !!!
Let's take a line of "1234567890". Result returned by iterator is:
>> 1
Conf: 98,65
Box: 1805, 771, 1843, 813
>> 2
Conf: 99,00
Box: 1811, 771, 1875, 813
>> 3
Conf: 99,00
Box: 1843, 771, 1927, 813
>> 4
Conf: 99,00
Box: 1890, 771, 1964, 813
>> 5 <<< DAM, what is here ?! Why letter "5" is reported with X coordinate right after letter "3", while really it goes after letter "5" ?!
Conf: 99,00
Box: 1927, 771, 2001, 813
>> 6 << This one is even more amazing. Letter "6" is said right the place of letter "1", and size is 30+mm !!!
Conf: 99,02
Box: 1805, 771, 2195, 813
>> 7
Conf: 98,99
Box: 2005, 771, 2090, 813
>> 8
Conf: 98,96
Box: 2053, 771, 2127, 813
>> 9
Conf: 99,01
Box: 2095, 771, 2158, 813
>> 0
Conf: 98,98
Box: 2126, 771, 2190, 813
четверг, 4 июля 2019 г., 15:09:13 UTC+3 пользователь shree написал: