Tesseract cannot recognize clean webpage screenshot

568 views
Skip to first unread message

JF

unread,
Nov 10, 2016, 1:14:26 PM11/10/16
to tesseract-ocr

I'm using Tesseract (3.04.01 with leptonica-1.73) on Mac OS 10.12 to segment a clean screenshot of a web page. 

Here is the command:


    tesseract screen.png output.txt


screen.png:


screen.png


output.txt:


a CSS Regwstratmnfi x

C (D localnostr

Accoum Dexans

Eu a Pine: 5" a

Fifi/(‘3’ 22pm; J. , km?“ ”9

Persuna‘ Dexaus Funhev \muvmanun

«m s , (35‘ m Was :6 ms

FMS, Emms' (u v Jaruawy

*1: \(uax y ,

Chum

Terms and Mamng
m any ‘ ‘ Regwsley»

w lc‘asehe :avicxflza \zh»,:\':\e

Mm , (ism-ye I/Exzavheilédgémzéi


The output is complete garbage except for a few words like "Terms and". 

I've read the "ImproveQuality" wiki, but I don't think any case applies to this image. 

Could anyone please tell me which command line options I should set to make it work? 


Thanks in advance!

Allistair

unread,
Nov 10, 2016, 4:03:43 PM11/10/16
to tesser...@googlegroups.com
What is it you are trying to achieve exactly?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0aa8871c-393d-4bdf-bd73-673cfa10494d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

JF

unread,
Nov 10, 2016, 7:20:23 PM11/10/16
to tesseract-ocr
I have an app that needs to recognize text in screenshots. 

Does that matter? I think this image is clean enough for Tesseract to recognize?
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

rkvsraman

unread,
Nov 11, 2016, 1:18:07 AM11/11/16
to tesseract-ocr
Check if the DPI is about 300. Screenshots generally have lesser DPI.

ShreeDevi Kumar

unread,
Nov 11, 2016, 1:24:06 AM11/11/16
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Allistair C

unread,
Nov 11, 2016, 3:08:29 AM11/11/16
to tesser...@googlegroups.com
So if Tesseract was able to detect every piece of text perfectly what would you use? It matters because you might not be thinking about the problem properly. For instance sometimes people ask how to ocr a screen but what they really want is a portion of the screen and so there's usually a step before tesseract to isolate rectangles of input.

Sent from my iPhone

Soumya Ghosh

unread,
Nov 11, 2016, 3:36:56 PM11/11/16
to tesseract-ocr
when you can scrap a web page details through scripting language like PYTHON/PHP
the why do you need to get texts from a screenshot?

Reply all
Reply to author
Forward
0 new messages