Hello,
I have been trying to make PDFs searchable using OCRmyPDF and Tesseract, but despite following recommended steps, I have been unable to get the desired results.
Here is a summary of the issues I have faced:
1. Initially, I tried running OCRmyPDF on a PDF document (created by exporting a PNG image to PDF via GIMP) using the command `ocrmypdf -l eng OCR_test_eng.pdf outputOCR.pdf`. The process completed without errors, but the output PDF was not searchable.
2. I then updated my Tesseract to version 5.3.1+git6228-24da4c71-1ppa1~jammy1, hoping it might resolve the problem. However, the issue persisted.
3. I also attempted using the `--force-ocr` option with OCRmyPDF, but the output PDF remained unsearchable. Interestingly, for a scanned PDF document, OCRmyPDF indicated that the document already had text, even though it was not searchable.
4. To rule out problems with OCRmyPDF, I tried using pdfsandwich for OCR. However, it reported that Tesseract was unable to produce a PDF output file, suggesting that the problem might be with Tesseract itself.
5. I am running these commands on a Linux system
Ubuntu 22.04.2 LTS
I have had no success with previous attempts at using Tesseract for OCR on Linux, and I'm hoping to finally resolve this issue. Any guidance would be greatly appreciated.
Best,
Filippos
---