Hello, I want to ocr an image with a colored background. I have seen that tesseract produced bad results in that case.
As a workaround, I want to convert my image to black and white and do the ocr on that image to produce an hocr file. After
that I want to combine the hocr and the original image (with the colored background) to get a searchable pdf. To convert the hocr file
I use hocr2pdf but I get bad results. The black and white image is "incas1_modif.tif". The resulting hocr file is incas.hocr. I wanted to merge the hocr file with
another image, not the "incas2_modif.tif". The results of the merge was poor, so I tried to create a pdf from the hocr only containing some text and containing no image.
I got it with
hocr2pdf -i incas1_modif.tif -s -o incas_test.pdf < incas.hocr
The resulting pdf "incas_test.pdf" is very strange: some text is overlayed, sometimes the font is much bigger than the font in the original picture and
some text has disappeared. I have found
this thread and I assume sometimes the result of hocr2pdf is bad.
So my question is: how can I produce a pdf from a hocr file ? Until now I had no success. Or: do you have another idea to get good results
from a colored image with tesseract ?
I'm using tesseract 3.03. I didn't attach the file "incas2_modif.tif" because it was too big.
Thank you!