Using tesseract on browser page insufficient

49 views
Skip to first unread message

Alexander Dietz

unread,
Feb 19, 2020, 11:25:31 AM2/19/20
to tesseract-ocr
Hello,

I am trying to use tesseract (Tesseract Open Source OCR Engine v3.04.01 with Leptonica) on a screenshot from a browser screen. So the text/image is completely computer generated. However, I get insufficient results. The image is here and you can clearly read the text

Selection_991.png

but tesseract only finds "Ever felthmg cement from Guhub".

How can I improve the accuracy?


Thanks
Alex

Lakshay Saini

unread,
Feb 19, 2020, 11:45:09 AM2/19/20
to tesseract-ocr
Hello,

It all depends on the image quality, that's all. You can try using newer version of tesseract. And also, you can try different psm and oem modes.

For more information on the modes use:
tesseract --help-extra

Regards
Lakshay

Alexander Dietz

unread,
Feb 19, 2020, 12:17:07 PM2/19/20
to tesseract-ocr
I seem to have problems with that. I try to do

tesseract  Selection_991.png output -oem 11
tesseract  Selection_991.png output -psm 11

But I get an error

read_params_file: Can't open 11


Maybe you can provide an explicit example how to use these options? There is no example given in the help, and the help itself then is not clear!!


Shree Devi Kumar

unread,
Feb 19, 2020, 12:33:27 PM2/19/20
to tesseract-ocr
You are using an old version of software.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/32c208ab-589b-497f-9f0c-8db002684df4%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lakshay Saini

unread,
Feb 19, 2020, 12:34:44 PM2/19/20
to tesseract-ocr
Hi,

Yes, it will show that error because when you use engine and page seg. modes you need to use "--psm 11" instead of "-psm 11". Same goes with oem. Also, oem 11 does not exist.

For example:

tesseract "selection_911.png" "selection_991" pdf --psm 11 --oem 1

Or

tesseract "selection_911.png" "selection_991" pdf --psm 11

For more knowledge you can refer my code on GitHub repository (it's a pdf to pdf converter written in python):

https://github.com/lakshay1296/OCR_Conversion_JPEG2PDF

Its still in development phase and name's misleading too. For trying out your image first convert it to a pdf then feed the code.

Make sure you have all the dependencies. Add poppler's path to environment variables and change it in the code too or you can remove poppler_path variable from the code. Then run GUI_Class.py

Alexander Dietz

unread,
Feb 19, 2020, 2:54:16 PM2/19/20
to tesseract-ocr
How to upgrade to the latest version of the software?

I tried

sudo apt-get update && sudo apt-get upgrade tesseract-ocr

but got an error in that process, related to some other package.

Any idea how to install the up-to-date version of tesseract instead?



On Wednesday, February 19, 2020 at 6:33:27 PM UTC+1, shree wrote:
You are using an old version of software.


On Wed, Feb 19, 2020 at 10:47 PM Alexander Dietz <alexand...@gmail.com> wrote:


On Wednesday, February 19, 2020 at 5:45:09 PM UTC+1, Lakshay Saini wrote:
Hello,

It all depends on the image quality, that's all. You can try using newer version of tesseract. And also, you can try different psm and oem modes.

For more information on the modes use:
tesseract --help-extra


I seem to have problems with that. I try to do

tesseract  Selection_991.png output -oem 11
tesseract  Selection_991.png output -psm 11

But I get an error

read_params_file: Can't open 11


Maybe you can provide an explicit example how to use these options? There is no example given in the help, and the help itself then is not clear!!


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Alexander Dietz

unread,
Feb 20, 2020, 3:20:01 AM2/20/20
to tesseract-ocr
With an update to version 4 (undocumented procedure!!) the application works much better

Thanks

On Wednesday, February 19, 2020 at 6:33:27 PM UTC+1, shree wrote:
You are using an old version of software.


On Wed, Feb 19, 2020 at 10:47 PM Alexander Dietz <alexand...@gmail.com> wrote:


On Wednesday, February 19, 2020 at 5:45:09 PM UTC+1, Lakshay Saini wrote:
Hello,

It all depends on the image quality, that's all. You can try using newer version of tesseract. And also, you can try different psm and oem modes.

For more information on the modes use:
tesseract --help-extra


I seem to have problems with that. I try to do

tesseract  Selection_991.png output -oem 11
tesseract  Selection_991.png output -psm 11

But I get an error

read_params_file: Can't open 11


Maybe you can provide an explicit example how to use these options? There is no example given in the help, and the help itself then is not clear!!


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Zdenko Podobny

unread,
Feb 20, 2020, 4:00:01 AM2/20/20
to tesser...@googlegroups.com
Why we should document how to use Ubuntu? You should be familiar with your OS.
PPA repositories for each tesseract version are listed on  https://tesseract-ocr.github.io/tessdoc/Home.html  

Zdenko


št 20. 2. 2020 o 9:20 Alexander Dietz <alexand...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/96d238c5-ef55-4024-8238-e4d4703ab7e3%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages