convert hocr to pdf

1,762 views

Skip to first unread message

Cédric

unread,

Oct 6, 2014, 5:42:01 PM10/6/14

to tesser...@googlegroups.com

Hello, I want to ocr an image with a colored background. I have seen that tesseract produced bad results in that case.
As a workaround, I want to convert my image to black and white and do the ocr on that image to produce an hocr file. After
that I want to combine the hocr and the original image (with the colored background) to get a searchable pdf. To convert the hocr file
I use hocr2pdf but I get bad results. The black and white image is "incas1_modif.tif". The resulting hocr file is incas.hocr. I wanted to merge the hocr file with
another image, not the "incas2_modif.tif". The results of the merge was poor, so I tried to create a pdf from the hocr only containing some text and containing no image.
I got it with

hocr2pdf -i incas1_modif.tif -s -o incas_test.pdf < incas.hocr

The resulting pdf "incas_test.pdf" is very strange: some text is overlayed, sometimes the font is much bigger than the font in the original picture and
some text has disappeared. I have found this thread and I assume sometimes the result of hocr2pdf is bad.

So my question is: how can I produce a pdf from a hocr file ? Until now I had no success. Or: do you have another idea to get good results
from a colored image with tesseract ?

I'm using tesseract 3.03. I didn't attach the file "incas2_modif.tif" because it was too big.

Thank you!

incas.hocr

incas_test.pdf

Cédric

unread,

Oct 14, 2014, 8:41:54 AM10/14/14

to tesser...@googlegroups.com

I have found two solutions. The first is pdfbeads. At the beginning it didn't work in Archlinux because there was a bug in the package.
Now the bug is solved and I can merge a hocr file with an image. But for me the quality of the pdf could be better.

An other option is HocrConverter. We can find several version, I took this one. The original thread is here. It was necessary
to update the script for python 3. At the end the quality of the pdf seems better. When we search a word in a pdf, the word is highlighted with a
box. The accuracy of the position of the box seems a little bit worse than with pdfbeads. For me it will be enough.

Perhaps those posts can help somebody.

Regards,

Cédric

Reply all

Reply to author

Forward

0 new messages