Usage:
ocropdf input.pdf > hocr-output.html
The following environment variables are recognised:
- PDFIMAGES: Path to 'pdfimages' if it's not in your path
- CONVERT: Path to 'pdfimages' if it's not in your path
- OCROSCRIPT: Path to 'ocroscript' if it's not in your path or this
script is not placed in the ocropus source tree (in the 'ocrocmd'
directory)
- tesslanguage: The language tesseract should use.
Known problems:
- Doesn't work with file names containing spaces.
- Only works with a singe PDF file.
Possible improvements
- reimplement it as Lua script.
- Use this approach (imagemagick) to be able to recognise TIFF and other
file formats.
If you have installed ocropus (with 'jam install') it should work from
any location, otherwise place it in the 'ocropus/ocrocmd' directory.
Cheers,
Christian
[1] http://www.foolabs.com/xpdf/
[2] http://www.imagemagick.org/script/index.php