using tesseract hocr output to create a searchable PDF

7,025 views
Skip to first unread message

Carlos

unread,
Nov 29, 2011, 4:42:00 PM11/29/11
to tesseract-ocr
Tesseract 3.01
hocr2pdf 0.8.5

My project has been using Tesseract to OCR documents for some time and
we are really happy with the results.

We have been recently asked to offer the documents in our system as
searchable PDFs.

My initial attempt has been to create a searchable PDF using the hocr
output generated by tesseract with hocr2pdf (http://www.exactcode.de/
site/open_source/exactimage/hocr2pdf/).

the placement of the text in the resulting PDF has some strange
quirks: words overlaying one another, words with oversized fonts,
strange line breaks etc. The problems are so stark that our current
results are not sufficient for a viable solution.

I don't know very much about the hocr format, however "overlaying"
words doesn't seem to be caused by tesseracts hocr output. I have
verified a number of times that over-laid words in the searchable PDF
have bbox coordinates in the hocr file that do not overlap at all.

- does anyone have experience generating searchable PDFs using
tesseract output?
- does anyone know of a simple way to visually inspect the placement
of the words specified by the hocr output - for instance, creating a
tiff from the hocr output. i would like to confirm that the tesseract
hocr output is correctly positioning the words.

sorry if this issue doesn't relate exclusively to tesseract ... at
this point I am not certain what the cause of the problem is.

Carlos

zdenko podobny

unread,
Nov 30, 2011, 4:15:45 AM11/30/11
to tesser...@googlegroups.com
just for remark: Mihail Radu Solcan in 2008 posted 2 articles [1],  [2]  about adding text to DjVu files. I am not sure if there are such possibilities/tools for pdf. Anyway - he used box file for this task (hocr was not available)

You did not specified language but in case of python try to have a look at OCRFeeder: is should be able to produce [3], with reportlab...


--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Carlos

unread,
Dec 2, 2011, 2:52:11 PM12/2/11
to tesseract-ocr
zdenko,

Thanks for the reply.

> You did not specified language but in case of python

I am pretty agnostic about language as long as it can run via the CLI
on linux - the OCR process is on the backend.

In case anyone else runs across this:

I am an OCR noob so the past few days have been pretty enlightening.
I have run across a number of other options to marry hOCR w/ an image
to generate searchable PDFs. Unfortunately, hocr2pdf is one of the
most prominent ones. It shows up pretty high on a lot of searchs and
is mentioned in various forums/blogs etc. I have found that hocr2pdf
generates fairly unusable searchable PDFs - the searchable text is
interleaved and really out of position.

Luckily, there are a number of other options in various languages.
The first OSS tool that I found to generated very usable searchable
PDFs generated from tesseract hOCR files has been pdfbeads - a ruby
gem. It has worked well with a diverse sample of documents.

At this time my primary concern with pdfbeads is that it is a pretty
niche library and it encapsulates all of the logic to generate the PDF
file. pdfbeads doesn't rely on other more heavily used/vetted/current
PDF generation libs to generate the PDF. It would have been a little
more comforting if pdfbeads concentrated on parsing the hOCR files and
adding the text layer via another lib ... assuming that is possible.

If this holds up I suspect that we are going to slot this into our OCR
process.

Carlos

Lahiru Himash Madusanka

unread,
Sep 28, 2012, 12:10:13 AM9/28/12
to tesser...@googlegroups.com

I'm using Quick PSF library in my app to create PDF's

Jeffrey Ratcliffe

unread,
Sep 28, 2012, 3:01:31 PM9/28/12
to tesser...@googlegroups.com
On 27 September 2012 22:17, Guido <guidos...@googlemail.com> wrote:
> Did you find any other solution than using tesseract and pdfbeads? What are
> your experiences so far?

If you are using Linux, try gscan2pdf.

Regards

Jeff
Reply all
Reply to author
Forward
0 new messages