Tiff support for tesseract 3.02 on Ubuntu 12.04

1,404 views
Skip to first unread message

Michael Lissner

unread,
Feb 3, 2013, 4:08:11 PM2/3/13
to tesser...@googlegroups.com
I have Ubuntu 12.04, which has tesseract 3.02 and leptonica version 1.69.

I've installed these, and also installed libtiff4 using apt-get.

When I try to process a document, I get:

↪ sudo tesseract united_states_v._ups_customhouse_brokerage_inc.tif united_states_v._ups_customhouse_brokerage_inc -l eng
Tesseract Open Source OCR Engine v3.02 with Leptonica
Error in pixReadFromTiffStream: spp not in set {1,3,4}
Error in pixReadStreamTiff: pix not read
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Unsupported image type.


Which seems baffling to me. I've tried reinstalling leptonica, reininstalling the tiff libraries, and reinstalling tesseract in the hope that they'd support tiffs once reinstalled. So far, nothing is helping.

I was hoping that Ubuntu 12.04 would support everything i needed it to without having to compile from source, but so far I've had bad luck. Is there a way to make this work?

Thanks,

Mike

zdenko podobny

unread,
Feb 3, 2013, 4:16:52 PM2/3/13
to tesser...@googlegroups.com
Can you send and example of you tif file?

Zdenko


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Mike Lissner

unread,
Feb 3, 2013, 4:29:05 PM2/3/13
to tesser...@googlegroups.com
It's about 300MB, unfortunately, but I generate it programmatically using imagemagick in a way that's worked in the past, so I don't think the tiff file itself is the issue.

If you're willing to download this monster, I'll post it to dropbox. I'd love the help, but I don't think it's the right problem.

zdenko podobny

unread,
Feb 3, 2013, 5:00:35 PM2/3/13
to tesser...@googlegroups.com
Are you able to generate just one page or small example? Or can you provide step how you create it (so I can create it)?
Tiff could be tricky. E.g. libtiff-4 do not work for me...

Zdenko

Mike Lissner

unread,
Feb 3, 2013, 5:06:55 PM2/3/13
to tesser...@googlegroups.com, zdenko podobny
Sure, that's a good idea.

Here's the original PDF: http://courtlistener.com/pdf/2008/05/28/united_states_v._ups_customhouse_brokerage_inc..pdf

If you download that, then run:

convert -depth 4 -density 300 united_states_v._ups_customhouse_brokerage_inc..pdf united_states_v._ups_customhouse_brokerage_inc..pd.tiff

You'll have the same tiff as me, I think. Curious to see what your results are. Thanks for the help.

Mike

zdenko podobny

unread,
Feb 3, 2013, 5:08:51 PM2/3/13
to tesser...@googlegroups.com
BTW: spp means Samples-per-pixel[1]. Are you able to instruct imagick to use 1,3 or 4?
And I found report on stackoverflow[2] - there mentioned that imagick use to set spp to 2, which should be invalid for tiff...


Zdenko

Mike Lissner

unread,
Feb 3, 2013, 5:18:32 PM2/3/13
to tesseract-ocr
OK, we're getting somewhere!

I figured out that the Ubuntu repo just doesn't work properly with tiffs, and recompiled and installed tesseract and leptonica.

So now when I run tesseract -v, I get:

↪ tesseract -v
tesseract 3.02.02
 leptonica-1.69
  libjpeg 8b : libpng 1.2.46 : libtiff 3.9.5 : zlib 1.2.3.4

Whereas previously, I didn't get anything mentioning libtiff.

From there, I ran the convert command on the stackoverflow post:

convert -depth 4 -density 300 -background white -flatten +matte united_states_v._ups_customhouse_brokerage_inc..pdf united_states_v._ups_customhouse_brokerage_inc2.tiff

The resulting file worked well with tesseract, but it only had the last page of the PDF...so it's close -- very close -- but not quite there yet.

Mike Lissner

unread,
Feb 3, 2013, 5:30:19 PM2/3/13
to tesseract-ocr
Looks like I'm all set.

I had to remove -flatten from the command above, and all is working now.

Thanks so much for the help.

TP

unread,
Feb 4, 2013, 2:33:50 AM2/4/13
to tesser...@googlegroups.com
On Sun, Feb 3, 2013 at 1:08 PM, Michael Lissner
<mlis...@michaeljaylissner.com> wrote:
> I have Ubuntu 12.04, which has tesseract 3.02 and leptonica version 1.69.
>
> I've installed these, and also installed libtiff4 using apt-get.

libtiff4 is also known as "bigtiff". [1] lists "important backward
incompatible changes in the public API" so I'm not convinced leptonica
can be built with libtiff4 (though I admit I've never tried to do so)
since leptonica is based on the libtiff 3.9.x series.

[1] http://www.remotesensing.org/libtiff/v4.0.0.html

Greg Dunkel

unread,
Feb 4, 2013, 10:07:04 AM2/4/13
to tesser...@googlegroups.com
I just scanned approximately 200 pages in Ubuntu 12.10 with no problems, using 3.02 package from the repository.  I had to use convert to improve the tiffs from my scanner, but I got very good results, with a very low error rate.  Didnothing special.

/greg


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
/greg
Reply all
Reply to author
Forward
0 new messages