Poor results with tesseract OCR'ing .tif (as compared to an on-line OCR)

154 views
Skip to first unread message

BDristan

unread,
Oct 24, 2014, 2:25:55 PM10/24/14
to tesser...@googlegroups.com
I'm quite new to tesseract.  I just tried to OCR an image as follows:
 
tesseract LockBits.tif LockBits -l eng
 
The output text was pretty messed up.  I ran tesseract 3.02 on Win7.
 
I then run an on-line OCR and got a perfect result.
 
Could someone please give me some hints on how to improve OCR with tesseract.
 
Attached is an image file that I used.
 
Thanks.
LockBits_0_0.tif

Simon Eigeldinger

unread,
Oct 24, 2014, 4:53:46 PM10/24/14
to tesser...@googlegroups.com
hi,

tested with tesseract 3.04 on windows:
i recently also tried a printed page which gave better results nearly no
errors.


here's the command and the output of tesseract:


$ tesseract c:\LockBits_0_0.tif c:\LockBits_0_0 -l eng

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
cygwin warning:
MS-DOS style path detected: c:\LockBits_0_0.txt
Preferred POSIX equivalent is: /LockBits_0_0.txt
CYGWIN environment variable option "nodosfilewarning" turns off this
warning.
Consult the user's guide for more details about POSIX paths:
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

tesseract just is a little bit pissed off about some tiff issues it seems.
additionally it didn't like my typing of paths which doesn't do all that
much to it so we just ignore the cygwin warning.

i got that result:

Paramems

rear [In]
Type mnsl Red‘

Pointer to a rectangle that speotfies the pomon of the hnmap to he looked.

flags finl
Type um

Set offlzgs that specify whether the |od<2d pomon of the hnmap rs
avarlahle for reading or lor wrmllg
and whether the caller has already allouted a butter. lndlvlduzl llags
are defined in the lmagelucande
enumeration.

flmntn [ln]
Type Pixeanlmat

integer that specrfies the fumial of the pixel data In the lempurary
bulfer. The plxel tonnat ulthe
temporary hurler does not have in he the same as the plxel tonnat ulthrs
nitmap ubjen. The
pixelronnat data type and constants that represent various pixel lormats
are defined in
adiplusprxellormatsh. For more mtonnahon about pixel furmal cnns'ams see
Image Pixel Forrnat
Constants CD“ version 1.0 does run support processing of
lésbityperrchannel images so yuu should
run set this parameter equal to PixelFormaMBbppRGB, PlxelFormachppARGfi, ur
PlxeanmlathppPARGB.

lackedElfmapDaf-I fin. out]
Type BixmapData‘

Pointer to a BilmapData object If the lrnageLockModeUserlnputam flag
ulthe flagx parameter is cleared
then IodredBltmupDam serves only as an output parameter. in that use.
the 5am data rnemherotthe
aiunapneta otnect reoerves a porrner to a temporary putter, which is
filled with the values otthe
requested plxels. The other data members at the aiunapueta otnect
rederve attrihutes (wldlh. height
lormat and stride) of the plxel data m the temporary tamer. If the plxel
data is stored hunamup. the
snide data memtrer is negative. it the pixel data is stored mpsdown, the
snide data memtrer is positive.
If the lmageLodtModeUserlnputaul flag at the flags parameter is set then
lodredliixmapbam serves as an
--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.ei...@vol.at
MSN: simon_ei...@hotmail.com
ICQ: 121823966
Jabber: doma...@andrelouis.com

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

Robert Melton

unread,
Oct 24, 2014, 5:20:27 PM10/24/14
to tesser...@googlegroups.com
Is that tiny file the actual file size you are running OCR on? If so,
scale up the image and I am guessing results will improve greatly.
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0274edc9-8744-489b-bcf5-0eabc9dbd5c0%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Robert Melton | http://robertmelton.com

Simon Eigeldinger

unread,
Oct 24, 2014, 5:27:59 PM10/24/14
to tesser...@googlegroups.com
hi,

is there a guideline what to do with poor quality pics?
i am blind so i have no clue what sighted people do with those. *smile*
and it seems tesseract can't do much about pic quality.
maybe imagemagick might be a good choice for fixing things?

greetings,
simon

Robert Melton

unread,
Oct 24, 2014, 5:56:36 PM10/24/14
to tesser...@googlegroups.com
There are actually imagemagick scripts pre-baked for doing text
clean-up. Google for imagemagick and textcleaner.
> https://groups.google.com/d/msgid/tesseract-ocr/544AC453.9090205%40vol.at.
>
> For more options, visit https://groups.google.com/d/optout.



--
Robert Melton | http://robertmelton.com

BDristan

unread,
Oct 25, 2014, 10:28:57 AM10/25/14
to tesser...@googlegroups.com, rob...@robertmelton.com
I resized the image (double the original size) and got the results that were 100% correct.  Thanks for the tip.

However, I'm wondering how I could automate the process.  That is,  without manually viewing a given image, how can I pre-process it (including resizing) so it is suitable for OCR?  I don't think that I could blindly blow up each image because some of them could already be large enough.

I've checked some on-line OCR services (including the ones that use tesseract) and they seem to be doing an excellent job.  So, somehow they are 'smart' enough to know what to do with input images.

I'd appreciate any pointers.
Thanks.
Reply all
Reply to author
Forward
0 new messages