hOCR to PDF with Python

298 views
Skip to first unread message

Jonathan Brinley

unread,
Apr 6, 2009, 9:34:11 AM4/6/09
to ocropus
Building off of Florian Hackenberger's Java-based converter (http://
groups.google.com/group/ocropus/browse_thread/thread/
3cf464bda5807952), I've built a small Python script to convert hOCR
documents to PDF. See http://xplus3.net/2009/04/02/convert-hocr-to-pdf/#more-207
for info and to download.

It can either be called from the command line:

$ python HocrConverter.py myHocrFile.html myImageFile.png output.pdf

or imported into a Python script:

from HocrConverter import HocrConverter
hocr = HocrConverter("myHocrFile.html")
hocr.to_text("output.txt")
hocr.to_pdf("myImageFile.png", "output.pdf")

The main differences between this script and Mr. Hackenberger's
script:
1. This stretches lines of text horizontally to fill the bounding box
2. This requires you to specify an image to use, rather than using the
image indicated in the hOCR file (in case you want to use a different
resolution image for the PDF)
3. This can output either PDF or plain text

Please let me know how it works for you. I'd welcome any suggestions
or contributions.

Have a nice day,
Jonathan



--
Jonathan M. Brinley

jonatha...@gmail.com
http://xplus3.net/

Reply all
Reply to author
Forward
0 new messages