Wrapper script for PDF files

144 views

Skip to first unread message

Christian Mahnke

unread,

Jun 3, 2008, 8:05:50 AM6/3/08

to ocropus

Hello,
I've attached (and uploaded) small wrapper script for 'ocroscript' that
can extract images from PDF Files.
It uses xpdf [1] and imagemagick [2] to extract and convert the pages
from the PDF file to JPEG images.

Usage:
ocropdf input.pdf > hocr-output.html

The following environment variables are recognised:
- PDFIMAGES: Path to 'pdfimages' if it's not in your path
- CONVERT: Path to 'pdfimages' if it's not in your path
- OCROSCRIPT: Path to 'ocroscript' if it's not in your path or this
script is not placed in the ocropus source tree (in the 'ocrocmd'
directory)
- tesslanguage: The language tesseract should use.

Known problems:
- Doesn't work with file names containing spaces.
- Only works with a singe PDF file.

Possible improvements
- reimplement it as Lua script.
- Use this approach (imagemagick) to be able to recognise TIFF and other
file formats.

If you have installed ocropus (with 'jam install') it should work from
any location, otherwise place it in the 'ocropus/ocrocmd' directory.

Cheers,
Christian

[1] http://www.foolabs.com/xpdf/
[2] http://www.imagemagick.org/script/index.php