Tesseract cannot recognize clean webpage screenshot

JF

unread,

Nov 10, 2016, 1:14:26 PM11/10/16

to tesseract-ocr

I'm using Tesseract (3.04.01 with leptonica-1.73) on Mac OS 10.12 to segment a clean screenshot of a web page.

Here is the command:

    tesseract screen.png output.txt

screen.png:

output.txt:

a CSS Regwstratmnﬁ x

C (D localnostr

Accoum Dexans

Eu a Pine: 5" a

Fiﬁ/(‘3’ 22pm; J. , km?“ ”9

Persuna‘ Dexaus Funhev \muvmanun

«m s , (35‘ m Was :6 ms

FMS, Emms' (u v Jaruawy

*1: \(uax y ,

Chum

Terms and Mamng
m any ‘ ‘ Regwsley»

w lc‘asehe :avicxﬂza \zh»,:\':\e

Mm , (ism-ye I/Exzavheilédgémzéi

The output is complete garbage except for a few words like "Terms and".

I've read the "ImproveQuality" wiki, but I don't think any case applies to this image.

Could anyone please tell me which command line options I should set to make it work?

Thanks in advance!

Allistair

unread,

Nov 10, 2016, 4:03:43 PM11/10/16

to tesser...@googlegroups.com

What is it you are trying to achieve exactly?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0aa8871c-393d-4bdf-bd73-673cfa10494d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

JF

unread,

Nov 10, 2016, 7:20:23 PM11/10/16

to tesseract-ocr

I have an app that needs to recognize text in screenshots.

Does that matter? I think this image is clean enough for Tesseract to recognize?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

rkvsraman

unread,

Nov 11, 2016, 1:18:07 AM11/11/16

to tesseract-ocr

Check if the DPI is about 300. Screenshots generally have lesser DPI.

ShreeDevi Kumar

unread,

Nov 11, 2016, 1:24:06 AM11/11/16

to tesser...@googlegroups.com

more tips in https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b4d0698-aed4-4655-ba89-80da17e31e53%40googlegroups.com.

Allistair C

unread,

Nov 11, 2016, 3:08:29 AM11/11/16

to tesser...@googlegroups.com

So if Tesseract was able to detect every piece of text perfectly what would you use? It matters because you might not be thinking about the problem properly. For instance sometimes people ask how to ocr a screen but what they really want is a portion of the screen and so there's usually a step before tesseract to isolate rectangles of input.

Sent from my iPhone

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/28c6052b-a79e-42ea-89e9-4a73a27219da%40googlegroups.com.

Soumya Ghosh

unread,

Nov 11, 2016, 3:36:56 PM11/11/16

to tesseract-ocr

when you can scrap a web page details through scripting language like PYTHON/PHP
the why do you need to get texts from a screenshot?

Reply all

Reply to author

Forward