Different OCR results for the same image - file from disk vs from memory

251 views

Skip to first unread message

krzysiekj94

unread,

May 19, 2022, 10:35:35 AM5/19/22

to tesseract-dev

Environment

Tesseract Version: 5.1.0
Platform: Windows 32-bit, compiled under MSVC 2017

Current Behavior:

I am writing an application where I can select different pieces of text in the preview. For performance reasons, I save this text in memory and then OCR it. The problem is that in some cases I get unnecessary characters like underline, semicolon that I don't normally see in the image (see example below for file.png). I am using Germany language trained model with PSM 1 = Automatic page segmentation with OSD.

See file: https://i.postimg.cc/kgxM4N5P/file.png

1. OCR from load from disk

2. OCR from memory:

Of course, there are more differences - I did a few tests - the results depend on how I select the text see below:

I am using CAPI and I am writing the code in C ++. I am using TessBaseAPISetImage functions. I don't know what the difference is from. Do you have any ideas what I could do to get the same OCR results from memory as read from a file?

Reply all

Reply to author

Forward

0 new messages