Issue with Multiple Character Choices in Output Stream

160 views
Skip to first unread message

Dave Wood

unread,
Oct 31, 2019, 1:56:53 AM10/31/19
to tesseract-dev

Please refer to the following link which refers to changes made to lstm_choices_mode


https://github.com/tesseract-ocr/tesseract/pull/2635


Unless I misunderstand what these options are supposed to do, it appears like there is a bug or oversight. Please refer to this user area thread:


https://groups.google.com/forum/#!topic/tesseract-ocr/5tC6appoUgE


There seems to be no way to prevent lstm from including duplicates in the generated text and/or HOCR output. The example in the thread above is a clear example of this.


Surely there must be an option to force Tesseract to choose the single character option with the highest confidence level.


Thanks

Dave Wood

unread,
Nov 28, 2019, 1:51:45 AM11/28/19
to tesseract-dev
Hello all:

I understand that Tesseract is open source and therefore there is no obligation for anyone to respond to issues that arise, but I am sincerely hoping that there is still development activity going on for Tesseract, and hoping that whomever is working on the LSTM engine code can take a look at the issue I am reporting.

What is happening is that when the LSTM engine determines that there are more than one relatively high confidence level character for a given area in the incoming image, it will then include all of these optional characters in the output stream.  It works this way both for the text output option and also for the HOCR output.

Take a look at the following small snippet of character level HOCR output from one of my test images:

      <span class='ocrx_word' id='word_1_45' title='bbox 1717 271 1744 317; x_wconf 65'>
       <span class='ocrx_cinfo' title='x_bboxes 1717 271 1733 317; x_conf 98.250854'>C</span>
       <span class='ocrx_cinfo' title='x_bboxes 1721 275 1744 304; x_conf 95.007477'>c</span>
      </span>

So in this case, Tesseract is having a hard time deciding whether that part of the incoming image is an uppercase or lowercase letter 'c'.  Consequently the output in this little example contains two characters where there should only be one. As is obvious from the box dimensions, these output characters are being generated from box areas on the image which overlap by 90% or more.  

Perhaps this behavior is desirable in some cases, but there should be an option to tell the LSTM engine to just include the single highest confidence level character when there are output characters whose boxes overlap significantly, with 'significantly' probably needing to be a configurable percentage.

I would appreciate any feedback that anyone can provide on this subject.

Thanks,

Dave
Reply all
Reply to author
Forward
0 new messages