hOCR to PDF converter

1,971 views
Skip to first unread message

Florian Hackenberger

unread,
Dec 16, 2007, 5:58:32 PM12/16/07
to ocropus
Hi!

I've hacked up a VERY basic hOCR to PDF converter in Java using iText
and jericho if anyone is interested. It reads all tags with bbox
properties and places the contained text into a box on a layer. The
original image is read from the ocr_page tag and added above the text.
The current shortcomings (to be solved within the next few weeks) are:
* Does not handle multiple pages
* Scaling the fonts to match the bounding boxes is not implemented
* Only uses tags with of the ocr_line class having the bbox property
(to be solved later)

Please tell me if you like it and whether I should package it
properly. Please bear in mind that the file was hacked up in about 5
hours, so don't expect well structured code. The result is sort of a
proof of concept. Patches are welcome.

The java file can be found here (it needs the jericho and iText2
libraries in order to compile):
http://www.acoveo.com/acoveo/files/HocrToPdf.java

Cheers,
Florian Hackenberger

--
DI Florian Hackenberger
flo...@hackenberger.at

Christian Kofler

unread,
Dec 17, 2007, 8:53:14 AM12/17/07
to ocropus
Hi Florian,

thanks for your contribution!

It's a nice idea and can be really useful.
We would be happy to see future versions with
the features you just mentioned!

Cheers,

Christian Kofler

On 16 Dez., 23:58, Florian Hackenberger
> flor...@hackenberger.at

Federico Tarantino

unread,
Nov 28, 2012, 5:14:23 AM11/28/12
to ocr...@googlegroups.com, florian.ha...@gmail.com
Hi,
i've found this class awesome.
At the end of file there is a TODO: "Scale the text width to fit the OCR bbox";
I developed this TODO:

You replaced TODO row with this:
boolean textScaled = false;                                    
do {
 
float lineWidth = defaultFont.getBaseFont().getWidthPoint(line, bboxHeightPt);
 
if(lineWidth < bboxWidthPt){
    textScaled
= true;
 
} else {
    bboxHeightPt
-=0.1f;
 
}
} while(textScaled==false);


After, i suggest to replace this row:
cb.setFontAndSize(defaultFont.getBaseFont(), Math.round(bboxHeightPt));
with this:
cb.setFontAndSize(defaultFont.getBaseFont(), bboxHeightPt);

Ciao! (I'am italian! :-p)

       Federico Tarantino

Federico Tarantino

unread,
Jun 5, 2014, 1:29:27 PM6/5/14
to ocr...@googlegroups.com, florian.ha...@gmail.com
Hi,
i attach the final java class for hocr2pdf with jericho and itext.

Ciao!
OcrService.java
Reply all
Reply to author
Forward
0 new messages