Hi,
I need to build tesseract-ocr (
https://code.google.com/p/tesseract-ocr/) from source in order to OCR some PDF files. Many people use "convert" of imagemagick to first convert a PDF to a TIFF then resort to Tesseract to OCR the TIFF to a text file.
Since Tesseract depends on the Leptonica Image Processing Library (
http://leptonica.com/), i had to build that from source as well. Our OS distro is the old RHEL 6.2. In our computing environment, most utilities/tools are not installed at the typical locations (/usr/bin, /usr/local, etc.). According to one of the README files, I don't need the JPEG/JPG & PNG headers/libs unless I need to write to a PDF so i did not yank them in (from our non-standard locations) while building Leptonica.
When I fired off Tesseract as in
/path/to/somewhere/install/tesseract-ocr_3.02.02/bin/tesseract t.tiff output
I got the following error message
Tesseract Open Source OCR Engine v3.02.03 With Leptonica
Error in findTiffCompression: function not present
Error in pixReadStreamTiff: function not present
Error in pixReadStream: tiff: no pix returned
Error in pixRead: pix not read
Unsupported image type
I am puzzled since the 2 missing functions are present in the shared lib according to my investigation below ...
From ldd of the Tesseract ELF binary:
% ldd /path/to/somewhere/install/tesseract-ocr_3.02.02/bin/tesseract
...
liblept.so.4 => /path/to/somewhere/install/Leptonica_1.71/lib/liblept.so.4 (0x00007fb10b4e6000)
And also the LD_LIBRARY_PATH setting (I know LD_LIBRARY_PATH is to be frowned upon but i only used it here as a temporary hack):
% echo $LD_LIBRARY_PATH
/path/to/somewhere/install/tesseract-ocr_3.02.02/lib:/path/to/somewhere/install/Leptonica_1.71/lib
The 2 functions that appeared in the error output above, namely findTiffCompression & pixReadStreamTiff, DO EXIST in the share lib:
% nm -D /path/to/somewhere/install/Leptonica_1.71/lib/liblept.so.4 | grep findTiffCompression
00000000001a0140 T findTiffCompression
% nm -D /path/to/somewhere/install/Leptonica_1.71/lib/liblept.so.4 | grep pixReadStreamTiff
00000000001a03e0 T pixReadStreamTiff
What am I missing here?
Thanks for reading.