how to make Tesseract 3.3.0 reads more accurately from image

220 views
Skip to first unread message

Justin Yeh

unread,
Feb 20, 2020, 8:31:55 AM2/20/20
to tesseract-ocr
Take this attached image for example, seems like Tesseract 3.3.0 (downloaded from nuget) cannot recognize few characters correctly: such as 8 and B, or 5 and Z, or 0 and O etc...

Is there any way that I could get string from image more accurately? or ... How do I avoid this kind of misinterpretation from tesseract?  I have tried to make them in bold size but the result is the same.... 

Any advice is welcome here. Thanks in advance!!!! 
test2.PNG

Lakshay Saini

unread,
Feb 20, 2020, 8:36:36 AM2/20/20
to tesseract-ocr
Hi there,

The image size and quality greatly impacts on the OCR quality. So, that can be reason.

And, you are using an old version of tesseract, try upgrading it to 4.1.1 and then test the image again.

Regards,
Lakshay

Lakshay Saini

unread,
Feb 20, 2020, 8:46:08 AM2/20/20
to tesseract-ocr
Hi again,

Here are the results from using tesseract v4.0.0.20181030 on windows

PFA

Regards
Lakshay
test.pdf
test.txt

Justin Yeh

unread,
Feb 20, 2020, 9:30:13 AM2/20/20
to tesseract-ocr
Sorry, I forgot to mention that I was using C# with Visual Studio 2015 to create an OCR application. 

Right now, all I can get from Nuget is tesseract 3.3.0.

I found this online but I am not sure how to reference it in my current C# project. 

Justin Yeh於 2020年2月20日星期四 UTC+8下午9時31分55秒寫道:

Lakshay Saini

unread,
Feb 20, 2020, 9:57:33 AM2/20/20
to tesseract-ocr
Hello,

You can go to my github repo, few months back I uploaded an executable file of tesseract version 4.0. If you use Windows you can use that to install newer version.

Github:
https://github.com/lakshay1296/ocrmyPDF_Windows?files=1

Regards
Lakshay

Jonathan Dahan

unread,
Feb 23, 2020, 12:08:24 PM2/23/20
to tesseract-ocr
There's a branch that is based on 4.0, check it out here:

Reply all
Reply to author
Forward
0 new messages