I am writing an application where I can select different pieces of text in the preview. For performance reasons, I save this text in memory and then OCR it. The problem is that in some cases I get unnecessary characters like underline, semicolon that I don't normally see in the image (see example below for file.png). I am using Germany language trained model with PSM 1 = Automatic page segmentation with OSD.
See file: https://i.postimg.cc/kgxM4N5P/file.png
1. OCR from load from disk
2. OCR from memory:
Of course, there are more differences - I did a few tests - the results depend on how I select the text see below:
I am using CAPI and I am writing the code in C ++. I am using TessBaseAPISetImage functions. I don't know what the difference is from. Do you have any ideas what I could do to get the same OCR results from memory as read from a file?