I've hacked up a VERY basic hOCR to PDF converter in Java using iText
and jericho if anyone is interested. It reads all tags with bbox
properties and places the contained text into a box on a layer. The
original image is read from the ocr_page tag and added above the text.
The current shortcomings (to be solved within the next few weeks) are:
* Does not handle multiple pages
* Scaling the fonts to match the bounding boxes is not implemented
* Only uses tags with of the ocr_line class having the bbox property
(to be solved later)
Please tell me if you like it and whether I should package it
properly. Please bear in mind that the file was hacked up in about 5
hours, so don't expect well structured code. The result is sort of a
proof of concept. Patches are welcome.
The java file can be found here (it needs the jericho and iText2
libraries in order to compile):
DI Florian Hackenberger