Hello all:
I understand that Tesseract is open source and therefore there is no obligation for anyone to respond to issues that arise, but I am sincerely hoping that there is still development activity going on for Tesseract, and hoping that whomever is working on the LSTM engine code can take a look at the issue I am reporting.
What is happening is that when the LSTM engine determines that there are more than one relatively high confidence level character for a given area in the incoming image, it will then include all of these optional characters in the output stream. It works this way both for the text output option and also for the HOCR output.
Take a look at the following small snippet of character level HOCR output from one of my test images:
<span class='ocrx_word' id='word_1_45' title='bbox 1717 271 1744 317; x_wconf 65'>
<span class='ocrx_cinfo' title='x_bboxes 1717 271 1733 317; x_conf 98.250854'>C</span>
<span class='ocrx_cinfo' title='x_bboxes 1721 275 1744 304; x_conf 95.007477'>c</span>
</span>
So in this case, Tesseract is having a hard time deciding whether that part of the incoming image is an uppercase or lowercase letter 'c'. Consequently the output in this little example contains two characters where there should only be one. As is obvious from the box dimensions, these output characters are being generated from box areas on the image which overlap by 90% or more.
Perhaps this behavior is desirable in some cases, but there should be an option to tell the LSTM engine to just include the single highest confidence level character when there are output characters whose boxes overlap significantly, with 'significantly' probably needing to be a configurable percentage.
I would appreciate any feedback that anyone can provide on this subject.
Thanks,
Dave