Introducing hOCR-Workflow Tools

259 views
Skip to first unread message

George Chriss

unread,
Oct 26, 2013, 11:44:01 AM10/26/13
to hOCR Google Group
Hi all,

I'm happy to be able to report on a new pair of Inkscape extensions related to hOCR, both of which are documented and available on Gitorious:
 https://gitorious.org/hocr-workflow

The first extension, inkscape-hocr / 'Export Image Overlay Text as hOCR', makes it possible to markup images with hOCR in Inkscape.  The ability to do so is especially important for documents where accuracy and language precision are more important than fast interpretation (i.e., machine recognition).

The second extension, inkscape-hocrPDF / 'Create Multi-Page PDF from hOCR HTML Directory', generates multipage, Unicode-friendly PDFs via ReportLab from source JPEG and hOCR HTML files with a hidden, machine-searchable text layer.  This extension is still in a draft form but works well after minor edits of hardcoded values (see documentation).

Testing is being done on a GNU/Linux system; Mac OS X 'should' work[1] and Windows support is largely untested.  inkscape-hocrPDF can handle ~120 8.5x11" 300DPI pages at a JPEG quality level of 85% on a 2GB-memory system, more with additional memory.  Additionally, inkscape-hocrPDF is not limited to hOCR files produced by inkscape-hocr.

Comments, questions, and patches welcomed.

Sincerely,
George


[1] The following patch will correct "The fantastic lxml wrapper for libxml2 is required" error message in OS X Lion:
 https://launchpadlibrarian.net/88914942/819209-python-extensions-lion.diff
Reply all
Reply to author
Forward
0 new messages