Issue with Multiple Character Choices in Output Stream

160 views

Skip to first unread message

Dave Wood

unread,

Oct 31, 2019, 1:56:53 AM10/31/19

to tesseract-dev

Please refer to the following link which refers to changes made to lstm_choices_mode

https://github.com/tesseract-ocr/tesseract/pull/2635

Unless I misunderstand what these options are supposed to do, it appears like there is a bug or oversight. Please refer to this user area thread:

https://groups.google.com/forum/#!topic/tesseract-ocr/5tC6appoUgE

There seems to be no way to prevent lstm from including duplicates in the generated text and/or HOCR output. The example in the thread above is a clear example of this.

Surely there must be an option to force Tesseract to choose the single character option with the highest confidence level.

Thanks

Dave Wood

unread,

Nov 28, 2019, 1:51:45 AM11/28/19

to tesseract-dev

Hello all:

I understand that Tesseract is open source and therefore there is no obligation for anyone to respond to issues that arise, but I am sincerely hoping that there is still development activity going on for Tesseract, and hoping that whomever is working on the LSTM engine code can take a look at the issue I am reporting.

What is happening is that when the LSTM engine determines that there are more than one relatively high confidence level character for a given area in the incoming image, it will then include all of these optional characters in the output stream. It works this way both for the text output option and also for the HOCR output.

Take a look at the following small snippet of character level HOCR output from one of my test images:

C

c

So in this case, Tesseract is having a hard time deciding whether that part of the incoming image is an uppercase or lowercase letter 'c'. Consequently the output in this little example contains two characters where there should only be one. As is obvious from the box dimensions, these output characters are being generated from box areas on the image which overlap by 90% or more.

Perhaps this behavior is desirable in some cases, but there should be an option to tell the LSTM engine to just include the single highest confidence level character when there are output characters whose boxes overlap significantly, with 'significantly' probably needing to be a configurable percentage.

I would appreciate any feedback that anyone can provide on this subject.

Thanks,

Dave

Reply all

Reply to author

Forward

0 new messages