any C/C++ hocr programs

207 views
Skip to first unread message

LM

unread,
Aug 30, 2011, 1:05:52 PM8/30/11
to hOCR
Am just finding out about hocr. Are there any cross-platform C or C++
tools that will convert hocr to PDF format? Is there any way to use
the hocr format in conjunction with a tool like wkhtmltopdf? Thanks.

Bill Janssen

unread,
Aug 30, 2011, 11:04:17 PM8/30/11
to ho...@googlegroups.com
Properly formatted hOCR should "just work" with wkhtmltopdf.

Bill

LM

unread,
Aug 31, 2011, 8:11:42 AM8/31/11
to hOCR
On Aug 30, 11:04 pm, Bill Janssen <bill.jans...@gmail.com> wrote:
> Properly formatted hOCR should "just work" with wkhtmltopdf.

I gave it a try. Didn't come out as I expected. Was attempting to
get PDFs of the scanned documents, keep the original look of the
scanned images, but make them searchable. I just tried sending a tif
file through Tesseract OCR which supports hocr output. I took the
html output (hocr) from Tessearact OCR and put it through
wkhtmltopdf. Ended up with a mess of jumbled and garbled words. The
words are searchable, but I lost the original graphics that I scanned
in. Am I missing something? Wanted to do similar to some of the
Google sites that offer search of their PDFs. The material in the
PDFs show graphic images that look just like the images in the tif
files, but they're still searchable. Any suggestions on better ways
to accomplish this? Thanks.

Janusz S. Bień

unread,
Aug 31, 2011, 8:28:37 AM8/31/11
to ho...@googlegroups.com
On Wed, 31 Aug 2011 LM <lme...@gmail.com> wrote:

> On Aug 30, 11:04 pm, Bill Janssen <bill.jans...@gmail.com> wrote:
>> Properly formatted hOCR should "just work" with wkhtmltopdf.
>
> I gave it a try. Didn't come out as I expected. Was attempting to
> get PDFs of the scanned documents, keep the original look of the
> scanned images, but make them searchable.

If you just want to make the documents searchable, try

http://jwilk.net/software/ocrodjvu

In my opinion for some applications DjVu is much better then PDF. At
least you will keep the original look.

Regards

JSB

--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Bill Janssen

unread,
Aug 31, 2011, 9:40:23 AM8/31/11
to ho...@googlegroups.com
That's not the way to do it, then :-) wkhtmltopdf just puts the hOCR
in the PDF; it really doesn't understand hOCR, so the text is all
re-flowed and uses different fonts.

You'll need to write a tool that puts the page image into the PDF,
then draws "invisible" text on top of that. UpLib, for instance, does
this, if you put a document into an UpLib repository and then pull it
out in PDF format.

Bill

LM

unread,
Aug 31, 2011, 11:59:39 AM8/31/11
to hOCR
Will take a look at the UpLib code then. If anyone runs across any
other libraries or applications in C or C++, please let me know.

Thanks.
Reply all
Reply to author
Forward
0 new messages