hOCR to PDF converter

Florian Hackenberger

unread,

Dec 16, 2007, 5:58:32 PM12/16/07

to ocropus

Hi!

I've hacked up a VERY basic hOCR to PDF converter in Java using iText
and jericho if anyone is interested. It reads all tags with bbox
properties and places the contained text into a box on a layer. The
original image is read from the ocr_page tag and added above the text.
The current shortcomings (to be solved within the next few weeks) are:
* Does not handle multiple pages
* Scaling the fonts to match the bounding boxes is not implemented
* Only uses tags with of the ocr_line class having the bbox property
(to be solved later)

Please tell me if you like it and whether I should package it
properly. Please bear in mind that the file was hacked up in about 5
hours, so don't expect well structured code. The result is sort of a
proof of concept. Patches are welcome.

The java file can be found here (it needs the jericho and iText2
libraries in order to compile):
http://www.acoveo.com/acoveo/files/HocrToPdf.java

Cheers,
Florian Hackenberger

--
DI Florian Hackenberger
flo...@hackenberger.at

Christian Kofler

unread,

Dec 17, 2007, 8:53:14 AM12/17/07

to ocropus

Hi Florian,

thanks for your contribution!

It's a nice idea and can be really useful.
We would be happy to see future versions with
the features you just mentioned!

Cheers,

Christian Kofler

On 16 Dez., 23:58, Florian Hackenberger

> flor...@hackenberger.at

Federico Tarantino

unread,

Nov 28, 2012, 5:14:23 AM11/28/12

to ocr...@googlegroups.com, florian.ha...@gmail.com

Hi,

i've found this class awesome.

At the end of file there is a TODO: "Scale the text width to fit the OCR bbox";

I developed this TODO:

You replaced TODO row with this:

boolean textScaled = false;                                     
do {
  float lineWidth = defaultFont.getBaseFont().getWidthPoint(line, bboxHeightPt);
  if(lineWidth < bboxWidthPt){
    textScaled = true;
  } else {
    bboxHeightPt-=0.1f;
  }
} while(textScaled==false);

After, i suggest to replace this row:

cb.setFontAndSize(defaultFont.getBaseFont(), Math.round(bboxHeightPt));

with this:

cb.setFontAndSize(defaultFont.getBaseFont(), bboxHeightPt);

Ciao! (I'am italian! :-p)

Federico Tarantino

unread,

Jun 5, 2014, 1:29:27 PM6/5/14

to ocr...@googlegroups.com, florian.ha...@gmail.com

Hi,

i attach the final java class for hocr2pdf with jericho and itext.

Ciao!

OcrService.java

Reply all

Reply to author

Forward